Multi-language speech recognition system

ABSTRACT

A speech recognition system includes distributed processing across a client and server for recognizing a spoken query by a user. A number of different speech models for different natural languages are used to support and detect a natural language spoken by a user. In some implementations an interactive electronic agent responds in the user&#39;s native language to facilitate an real-time, human like dialogue.

RELATED APPLICATIONS

The present application claims priority to and is a continuation of Ser.No. 10/684,357 filed Oct. 10, 2003—which in turn is a continuation ofSer. No. 09/439,145 filed Nov. 12, 1999 (now U.S. Pat. No. 6,633,846).Both applications are hereby incorporated by reference herein.

FIELD OF THE INVENTION

The invention relates to a system and an interactive method forresponding to speech based user inputs and queries presented over adistributed network such as the INTERNET or local intranet. Thisinteractive system when implemented over the World-Wide Web services(WWW) of the INTERNET, functions so that a client or user can ask aquestion in a natural language such as English, French, German, Spanishor Japanese and receive the appropriate answer at his or her computer oraccessory also in his or her native natural language. The system hasparticular applicability to such applications as remote learning,e-commerce, technical e-support services, INTERNET searching, etc.

BACKGROUND OF THE INVENTION

The INTERNET, and in particular, the World-Wide Web (WWW), is growing inpopularity and usage for both commercial and recreational purposes, andthis trend is expected to continue. This phenomenon is being driven, inpart, by the increasing and widespread use of personal computer systemsand the availability of low cost INTERNET access. The emergence ofinexpensive INTERNET access devices and high speed access techniquessuch as ADSL, cable modems, satellite modems, and the like, are expectedto further accelerate the mass usage of the WWW.

Accordingly, it is expected that the number of entities offeringservices, products, etc., over the WWW will increase dramatically overthe coming years. Until now, however, the INTERNET “experience” forusers has been limited mostly to non-voice based input/output devices,such as keyboards, intelligent electronic pads, mice, trackballs,printers, monitors, etc. This presents somewhat of a bottleneck forinteracting over the WWW for a variety of reasons.

First, there is the issue of familiarity. Many kinds of applicationslend themselves much more naturally and fluently to a voice-basedenvironment. For instance, most people shopping for audio recordings arevery comfortable with asking a live sales clerk in a record store forinformation on titles by a particular author, where they can be found inthe store, etc. While it is often possible to browse and search on one'sown to locate items of interest, it is usually easier and more efficientto get some form of human assistance first, and, with few exceptions,this request for assistance is presented in the form of a oral query. Inaddition, many persons cannot or will not, because of physical orpsychological barriers, use any of the aforementioned conventional I/Odevices. For example, many older persons cannot easily read the textpresented on WWW pages, or understand the layout/hierarchy of menus, ormanipulate a mouse to make finely coordinated movements to indicatetheir selections. Many others are intimidated by the look and complexityof computer systems, WWW pages, etc., and therefore do not attempt touse online services for this reason as well.

Thus, applications which can mimic normal human interactions are likelyto be preferred by potential on-line shoppers and persons looking forinformation over the WWW. It is also expected that the use ofvoice-based systems will increase the universe of persons willing toengage in e-commerce, e-learning, etc. To date, however, there are veryfew systems, if any, which permit this type of interaction, and, if theydo, it is very limited. For example, various commercial programs sold byIBM (VIAVOICE™) and Kurzweil (DRAGON™) permit some user control of theinterface (opening, closing files) and searching (by using previouslytrained URLs) but they do not present a flexible solution that can beused by a number of users across multiple cultures and without timeconsuming voice training. Typical prior efforts to implement voice basedfunctionality in an INTERNET context can be seen in U.S. Pat. No.5,819,220 incorporated by reference herein.

Another issue presented by the lack of voice-based systems isefficiency. Many companies are now offering technical support over theINTERNET, and some even offer live operator assistance for such queries.While this is very advantageous (for the reasons mentioned above) it isalso extremely costly and inefficient, because a real person must beemployed to handle such queries. This presents a practical limit thatresults in long wait times for responses or high labor overheads. Anexample of this approach can be seen U.S. Pat. No. 5,802,526 alsoincorporated by reference herein. In general, a service presented overthe WWW is far more desirable if it is “scalable,” or, in other words,able to handle an increasing amount of user traffic with little if anyperceived delay or troubles by a prospective user.

In a similar context, while remote learning has become an increasinglypopular option for many students, it is practically impossible for aninstructor to be able to field questions from more than one person at atime. Even then, such interaction usually takes place for only a limitedperiod of time because of other instructor time constraints. To date,however, there is no practical way for students to continue a human-likequestion and answer type dialog after the learning session is over, orwithout the presence of the instructor to personally address suchqueries.

Conversely, another aspect of emulating a human-like dialog involves theuse of oral feedback. In other words, many persons prefer to receiveanswers and information in audible form. While a form of thisfunctionality is used by some websites to communicate information tovisitors, it is not performed in a real-time, interactivequestion-answer dialog fashion so its effectiveness and usefulness islimited.

Yet another area that could benefit from speech-based interactioninvolves so-called “search” engines used by INTERNET users to locateinformation of interest at web sites, such as the those available atYAHOO®.com, METACRAWLER®.com, EXCITE®.com, etc. These tools permit theuser to form a search query using either combinations of keywords ormetacategories to search through a web page database containing textindices associated with one or more distinct web pages. After processingthe user's request, therefore, the search engine returns a number ofhits which correspond, generally, to URL pointers and text excerpts fromthe web pages that represent the closest match made by such searchengine for the particular user query based on the search processinglogic used by search engine. The structure and operation of such priorart search engines, including the mechanism by which they build the webpage database, and parse the search query, are well known in the art. Todate, applicant is unaware of any such search engine that can easily andreliably search and retrieve information based on speech input from auser.

There are a number of reasons why the above environments (e-commerce,e-support, remote learning, INTERNET searching, etc.) do not utilizespeech-based interfaces, despite the many benefits that would otherwiseflow from such capability. First, there is obviously a requirement thatthe output of the speech recognizer be as accurate as possible. One ofthe more reliable approaches to speech recognition used at this time isbased on the Hidden Markov Model (HMM)—a model used to mathematicallydescribe any time series. A conventional usage of this technique isdisclosed, for example, in U.S. Pat. No. 4,587,670 incorporated byreference herein. Because speech is considered to have an underlyingsequence of one or more symbols, the HMM models corresponding to eachsymbol are trained on vectors from the speech waveforms. The HiddenMarkov Model is a finite set of states, each of which is associated witha (generally multi-dimensional) probability distribution. Transitionsamong the states are governed by a set of probabilities calledtransition probabilities. In a particular state an outcome orobservation can be generated, according to the associated probabilitydistribution. This finite state machine changes state once every timeunit, and each time t such that a state j is entered, a spectralparameter vector O_(t) is generated with probability densityB_(j)(O_(t)). It is only the outcome, not the state visible to anexternal observer and therefore states are “hidden” to the outside;hence the name Hidden Markov Model. The basic theory of HMMs waspublished in a series of classic papers by Baum and his colleagues inthe late 1960's and early 1970's. HMMs were first used in speechapplications by Baker at Carnegie Mellon, by Jelenik and colleagues atIBM in the late 1970's and by Steve Young and colleagues at CambridgeUniversity, UK in the 1990's. Some typical papers and texts are asfollows:

-   -   1. L. E. Baum, T. Petrie, “Statistical inference for        probabilistic functions for finite state Markov chains”, Ann.        Math. Stat., 37: 1554-1563, 1966    -   2. L. E. Baum, “An inequality and associated maximation        technique in statistical estimation for probabilistic functions        of Markov processes”, Inequalities 3: 1-8, 1972

-   3. J. H. Baker, “The dragon system—An Overview”, IEEE Trans. on ASSP    Proc., ASSP-23(1): 24-29, Feb. 1975

-   4. F. Jeninek et al, “Continuous Speech Recognition: Statistical    methods” in Handbook of Statistics, II, P. R. Kristnaiad, Ed.    Amsterdam, The Netherlands, North-Holland, 1982

-   5. L. R. Bahl, F. Jeninek, R. L. Mercer, “A maximum likelihood    approach to continuous speech recognition”, IEEE Trans. Pattern    Anal. Mach. Intell., PAMI-5: 179-190, 1983

-   6. J. D. Ferguson, “Hidden Markov Analysis: An Introduction”, in    Hidden Markov Models for Speech, Institute of Defense Analyses,    Princeton, N.J. 1980.

-   7. H. R. Rabiner and B. H. Juang, “Fundamentals of Speech.    Recognition”, Prentice Hall, 1993

-   8. H. R. Rabiner, “Digital Processing of Speech Signals”, Prentice    Hall, 1978 More recently research has progressed in extending HMM    and combining HMMs with neural networks to speech recognition    applications at various laboratories. The following is a    representative paper:

-   9. Nelson Morgan, Hervé Bourlard, Steve Renals, Michael Cohen and    Horacio Franco (1993), Hybrid Neural Network/Hidden Markov Model    Systems for Continuous Speech Recognition. Journal of Pattern    Recognition and Artificial Intelligence, Vol. 7, No. 4 pp. 899-916.    Also in I. Guyon and P. Wang editors, Advances in Pattern    Recognition Systems using Neural Networks, Vol. 7 of a Series in    Machine Perception and Artificial Intelligence. World Scientific,    February 1994.

All of the above are hereby incorporated by reference. While theHMM-based speech recognition yields very good results, contemporaryvariations of this technique cannot guarantee a word accuracyrequirement of 100% exactly and consistently, as will be required forWWW applications for all possible all user and environment conditions.Thus, although speech recognition technology has been available forseveral years, and has improved significantly, the technicalrequirements have placed severe restrictions on the specifications forthe speech recognition accuracy that is required for an application thatcombines speech recognition and natural language processing to worksatisfactorily.

In contrast to word recognition, Natural language processing (NLP) isconcerned with the parsing, understanding and indexing of transcribedutterances and larger linguistic units. Because spontaneous speechcontains many surface phenomena such as disfluencies,—hesitations,repairs and restarts, discourse markers such as ‘well’ and otherelements which cannot be handled by the typical speech recognizer, it isthe problem and the source of the large gap that separates speechrecognition and natural language processing technologies. Except forsilence between utterances, another problem is the absence of any markedpunctuation available for segmenting the speech input into meaningfulunits such as utterances. For optimal NLP performance, these types ofphenomena should be annotated at its input. However, most continuousspeech recognition systems produce only a raw sequence of words.Examples of conventional systems using NLP are shown in U.S. Pat. Nos.4,991,094, 5,068,789, 5,146,405 and 5,680,628, all of which areincorporated by reference herein.

Second, most of the very reliable voice recognition systems arespeaker-dependent, requiring that the interface be “trained” with theuser's voice, which takes a lot of time, and is thus very undesirablefrom the perspective of a WWW environment, where a user may interactonly a few times with a particular website. Furthermore,speaker-dependent systems usually require a large user dictionary (onefor each unique user) which reduces the speed of recognition. This makesit much harder to implement a real-time dialog interface withsatisfactory response capability (i.e., something that mirrors normalconversation—on the order of 3-5 seconds is probably ideal). At present,the typical shrink-wrapped speech recognition application softwareinclude offerings from IBM (VIAVOICE™) and Dragon Systems (DRAGON TV.While most of these applications are adequate for dictation and othertranscribing applications, they are woefully inadequate for applicationssuch as NLQS where the word error rate must be close to 0%. In additionthese offerings require long training times and are typically are nonclient-server configurations. Other types of trained systems arediscussed in U.S. Pat. No. 5,231,670 assigned to Kurzweil, and which isalso incorporated by reference herein.

Another significant problem faced in a distributed voice-based system isa lack of uniformity/control in the speech recognition process. In atypical stand-alone implementation of a speech recognition system, theentire SR engine runs on a single client. A well-known system of thistype is depicted in U.S. Pat. No. 4,991,217 incorporated by referenceherein. These clients can take numerous forms (desktop PC, laptop PC,PDA, etc.) having varying speech signal processing and communicationscapability. Thus, from the server side perspective, it is not easy toassure uniform treatment of all users accessing a voice-enabled webpage, since such users may have significantly disparate word recognitionand error rate performances. While a prior art reference to Gould etal.—U.S. Pat. No. 5,915,236—discusses generally the notion of tailoringa recognition process to a set of available computational resources, itdoes not address or attempt to solve the issue of how to optimizeresources in a distributed environment such as a client-server model.Again, to enable such voice-based technologies on a wide-spread scale itis far more preferable to have a system that harmonizes and accounts fordiscrepancies in individual systems so that even the thinnest client issupportable, and so that all users are able to interact in asatisfactory manner with the remote server running the e-commerce,e-support and/or remote learning application.

Two references that refer to a distributed approach for speechrecognition include U.S. Pat. Nos. 5,956,683 and 5,960,399 incorporatedby reference herein. In the first of these, U.S. Pat. No.5,956,683—Distributed Voice Recognition System (assigned to Qualcomm) animplementation of a distributed voice recognition system between atelephony-based handset and a remote station is described. In thisimplementation, all of the word recognition operations seem to takeplace at the handset. This is done since the patent describes thebenefits that result from locating of the system for acoustic featureextraction at the portable or cellular phone in order to limitdegradation of the acoustic features due to quantization distortionresulting from the narrow bandwidth telephony channel. This referencetherefore does not address the issue of how to ensure adequateperformance for a very thin client platform. Moreover, it is difficultto determine, how, if at all, the system can perform real-time wordrecognition, and there is no meaningful description of how to integratethe system with a natural language processor.

The second of these references—U.S. Pat. No. 5,960,399—Client/ServerSpeech Processor/Recognizer (assigned to GTE) describes theimplementation of a HMM-based distributed speech recognition system.This reference is not instructive in many respects, however, includinghow to optimize acoustic feature extraction for a variety of clientplatforms, such as by performing a partial word recognition processwhere appropriate. Most importantly, there is only a description of aprimitive server-based recognizer that only recognizes the user's speechand simply returns certain keywords such as the user's name and traveldestination to fill out a dedicated form on the user's machine. Also,the streaming of the acoustic parameters does not appear to beimplemented in real-time as it can only take place after silence isdetected. Finally, while the reference mentions the possible use ofnatural language processing (column 9) there is no explanation of howsuch function might be implemented in a real-time fashion to provide aninteractive feel for the user.

SUMMARY OF THE INVENTION

An object of the present invention, therefore, is to provide an improvedsystem and method for overcoming the limitations of the prior art notedabove;

A primary object of the present invention is to provide a word andphrase recognition system that is flexibly and optimally distributedacross a client/platform computing architecture, so that improvedaccuracy, speed and uniformity can be achieved for a wide group ofusers;

A further object of the present invention is to provide a speechrecognition system that efficiently integrates a distributed wordrecognition system with a natural language processing system, so thatboth individual words and entire speech utterances can be quickly andaccurately recognized in any number of possible languages;

A related object of the present invention is to provide an efficientquery response system so that an extremely accurate, real-time set ofappropriate answers can be given in response to speech-based queries;

Yet another object of the present invention is to provide aninteractive, real-time instructional/learning system that is distributedacross a client/server architecture, and permits a real-timequestion/answer session with an interactive character;

A related object of the present invention is to implement suchinteractive character with an articulated response capability so thatthe user experiences a human-like interaction;

Still a further object of the present invention is to provide anINTERNET website with speech processing capability so that voice baseddata and commands can be used to interact with such site, thus enablingvoice-based e-commerce and e-support services to be easily scaleable;

Another object is to implement a distributed speech recognition systemthat utilizes environmental variables as part of the recognition processto improve accuracy and speed;

A further object is to provide a scaleable query/response databasesystem, to support any number of query topics and users as needed for aparticular application and instantaneous demand;

Yet another object of the present invention is to provide a queryrecognition system that employs a two-step approach, including arelatively rapid first step to narrow down the list of potentialresponses to a smaller candidate set, and a second more computationallyintensive second step to identify the best choice to be returned inresponse to the query from the candidate set;

A further object of the present invention is to provide a naturallanguage processing system that facilitates query recognition byextracting lexical components of speech utterances, which components canbe used for rapidly identifying a candidate set of potential responsesappropriate for such speech utterances;

Another related object of the present invention is to provide a naturallanguage processing system that facilitates query recognition bycomparing lexical components of speech utterances with a candidate setof potential response to provide an extremely accurate best response tosuch query.

One general aspect of the present invention, therefore, relates to anatural language query system (NLQS) that offers a fully interactivemethod for answering user's questions over a distributed network such asthe INTERNET or a local intranet. This interactive system whenimplemented over the worldwide web (WWW) services of the INTERNETfunctions so that a client or user can ask a question in a naturallanguage such as English, French, German or Spanish and receive theappropriate answer at his or her personal computer also in his or hernative natural language.

The system is distributed and consists of a set of integrated softwaremodules at the client's machine and another set of integrated softwareprograms resident on a server or set of servers. The client-sidesoftware program is comprised of a speech recognition program, an agentand its control program, and a communication program. The server-sideprogram is comprised of a communication program, a natural languageengine (NLE), a database processor (DBProcess), an interface program forinterfacing the DBProcess with the NLE, and a SQL database. In addition,the client's machine is equipped with a microphone and a speaker.Processing of the speech utterance is divided between the client andserver side so as to optimize processing and transmission latencies, andso as to provide support for even very thin client platforms.

In the context of an interactive learning application, the system isspecifically used to provide a single-best answer to a user's question.The question that is asked at the client's machine is articulated by thespeaker and captured by a microphone that is built in as in the case ofa notebook computer or is supplied as a standard peripheral attachment.Once the question is captured, the question is processed partially byNLQS client-side software resident in the client's machine. The outputof this partial processing is a set of speech vectors that aretransported to the server via the INTERNET to complete the recognitionof the user's questions. This recognized speech is then converted totext at the server.

After the user's question is decoded by the speech recognition engine(SRE) located at the server, the question is converted to a structuredquery language (SQL) query. This query is then simultaneously presentedto a software process within the server called DBProcess for preliminaryprocessing and to a Natural Language Engine (NLE) module for extractingthe noun phrases (NP) of the user's question. During the process ofextracting the noun phrase within the NLE, the tokens of the users'question are tagged. The tagged tokens are then grouped so that the NPlist can be determined. This information is stored and sent to theDBProcess process.

In the DBProcess, the SQL query is fully customized using the NPextracted from the user's question and other environment variables thatare relevant to the application. For example, in a training application,the user's selection of course, chapter and or section would constitutethe environment variables. The SQL query is constructed using theextended SQL Full-Text predicates—CONTAINS, FREETEXT, NEAR, AND. The SQLquery is next sent to the Full-Text search engine within the SQLdatabase, where a Full-Text search procedure is initiated. The result ofthis search procedure is recordset of answers. This recordset containsstored questions that are similar linguistically to the user's question.Each of these stored questions has a paired answer stored in a separatetext file, whose path is stored in a table of the database.

The entire recordset of returned stored answers is then returned to theNLE engine in the form of an array. Each stored question of the array isthen linguistically processed sequentially one by one. This linguisticprocessing constitutes the second step of a 2-step algorithm todetermine the single best answer to the user's question. This secondstep proceeds as follows: for each stored question that is returned inthe recordset, a NP of the stored question is compared with the NP ofthe user's question. After all stored questions of the array arecompared with the user's question, the stored question that yields themaximum match with the user's question is selected as the best possiblestored question that matches the user's question. The metric that isused to determine the best possible stored question is the number ofnoun phrases.

The stored answer that is paired to the best-stored question is selectedas the one that answers the user's question. The ID tag of the questionis then passed to the DBProcess. This DBProcess returns the answer whichis stored in a file.

A communication link is again established to send the answer back to theclient in compressed form. The answer once received by the client isdecompressed and articulated to the user by the text-to-speech engine.Thus, the invention can be used in any number of different applicationsinvolving interactive learning systems, INTERNET related commerce sites,INTERNET search engines, etc.

Computer-assisted instruction environments often require the assistanceof mentors or live teachers to answer questions from students. Thisassistance often takes the form of organizing a separate pre-arrangedforum or meeting time that is set aside for chat sessions or livecall-in sessions so that at a scheduled time answers to questions may beprovided. Because of the time immediacy and the on-demand orasynchronous nature of on-line training where a student may log on andtake instruction at any time and at any location, it is important thatanswers to questions be provided in a timely and cost-effective mannerso that the user or student can derive the maximum benefit from thematerial presented.

This invention addresses the above issues. It provides the user orstudent with answers to questions that are normally channeled to a liveteacher or mentor. This invention provides a single-best answer toquestions asked by the student. The student asks the question in his orher own voice in the language of choice. The speech is recognized andthe answer to the question is found using a number of technologiesincluding distributed speech recognition, full-text search databaseprocessing, natural language processing and text-to-speech technologies.The answer is presented to the user, as in the case of a live teacher,in an articulated manner by an agent that mimics the mentor or teacher,and in the language of choice—English, French, German, Japanese or othernatural spoken language. The user can choose the agent's gender as wellas several speech parameters such as pitch, volume and speed of thecharacter's voice.

Other applications that benefit from NLQS are e-commerce applications.In this application, the user's query for a price of a book, compactdisk or for the availability of any item that is to be purchased can beretrieved without the need to pick through various lists on successiveweb pages. Instead, the answer is provided directly to the user withoutany additional user input.

Similarly, it is envisioned that this system can be used to provideanswers to frequently-asked questions (FAQs), and as a diagnosticservice tool for e-support. These questions are typical of a give website and are provided to help the user find information related to apayment procedure or the specifications of, or problems experienced witha product/service. In all of these applications, the NLQS architecturecan be applied.

A number of inventive methods associated with these architectures arealso beneficially used in a variety of INTERNET related applications.

Although the inventions are described below in a set of preferredembodiments, it will be apparent to those skilled in the art the presentinventions could be beneficially used in many environments where it isnecessary to implement fast, accurate speech recognition, and/or toprovide a human-like dialog capability to an intelligent system.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a preferred embodiment of a naturallanguage query system (NLQS) of the present invention, which isdistributed across a client/server computing architecture, and can beused as an interactive learning system, an e-commerce system, ane-support system, and the like;

FIGS. 2A-2C are a block diagram of a preferred embodiment of a clientside system, including speech capturing modules, partial speechprocessing modules, encoding modules, transmission modules, agentcontrol modules, and answer/voice feedback modules that can be used inthe aforementioned NLQS;

FIG. 2D is a block diagram of a preferred embodiment of a set ofinitialization routines and procedures used for the client side systemof FIG. 2A-2C;

FIG. 3 is a block diagram of a preferred embodiment of a set of routinesand procedures used for handling an iterated set of speech utterances onthe client side system of FIG. 2A-2C, transmitting speech data for suchutterances to a remote server, and receiving appropriate responses backfrom such server;

FIG. 4 is a block diagram of a preferred embodiment of a set ofinitialization routines and procedures used for un-initializing theclient side system of FIGS. 2A-2C;

FIG. 4A is a block diagram of a preferred embodiment of a set ofroutines and procedures used for implementing a distributed component ofa speech recognition module for the server side system of FIG. 5;

FIG. 4B is a block diagram of a preferred set of routines and proceduresused for implementing an SQL query builder for the server side system ofFIG. 5;

FIG. 4C is a block diagram of a preferred embodiment of a set ofroutines and procedures used for implementing a database control processmodule for the server side system of FIG. 5;

FIG. 4D is a block diagram of a preferred embodiment of a set ofroutines and procedures used for implementing a natural language enginethat provides query formulation support, a query response module, and aninterface to the database control process module for the server sidesystem of FIG. 5;

FIG. 5 is a block diagram of a preferred embodiment of a server sidesystem, including a speech recognition module to complete processing ofthe speech utterances, environmental and grammar control modules, queryformulation modules, a natural language engine, a database controlmodule, and a query response module that can be used in theaforementioned NLQS;

FIG. 6 illustrates the organization of a full-text database used as partof server side system shown in FIG. 5;

FIG. 7A illustrates the organization of a full-text database coursetable used as part of server side system shown in FIG. 5 for aninteractive learning embodiment of the present invention;

FIG. 7B illustrates the organization of a full-text database chaptertable used as part of server side system shown in FIG. 5 for aninteractive learning embodiment of the present invention;

FIG. 7C describes the fields used in a chapter table used as part ofserver side system shown in FIG. 5 for an interactive learningembodiment of the present invention;

FIG. 7D describes the fields used in a section table used as part ofserver side system shown in FIG. 5 for an interactive learningembodiment of the present invention;

FIG. 8 is a flow diagram of a first set of operations performed by apreferred embodiment of a natural language engine on a speech utteranceincluding Tokenization, Tagging and Grouping;

FIG. 9 is a flow diagram of the operations performed by a preferredembodiment of a natural language engine on a speech utterance includingstemming and Lexical Analysis

FIG. 10 is a block diagram of a preferred embodiment of a SQL databasesearch and support system for the present invention;

FIGS. 11A-11C are flow diagrams illustrating steps performed in apreferred two step process implemented for query recognition by the NLQSof FIG. 2;

FIG. 12 is an illustration of another embodiment of the presentinvention implemented as part of a Web-based speech basedlearning/training System;

FIGS. 13-17 are illustrations of another embodiment of the presentinvention implemented as part of a Web-based e-commerce system;

FIG. 18 is an illustration of another embodiment of the presentinvention implemented as part of a voice-based Help Page for anE-Commerce Web Site.

DETAILED DESCRIPTION OF THE INVENTION

Overview

As alluded to above, the present inventions allow a user to ask aquestion in a natural language such as English, French, German, Spanishor Japanese at a client computing system (which can be as simple as apersonal digital assistant or cell-phone, or as sophisticated as a highend desktop PC) and receive an appropriate answer from a remote serveralso in his or her native natural language. As such, the embodiment ofthe invention shown in FIG. 1 is beneficially used in what can begenerally described as a Natural Language Query System (NLQS) 100, whichis configured to interact on a real-time basis to give a human-likedialog capability/experience for e-commerce, e-support, and e-learningapplications.

The processing for NLQS 100 is generally distributed across a clientside system 150, a data link 160, and a server-side system 180. Thesecomponents are well known in the art, and in a preferred embodimentinclude a personal computer system 150, an INTERNET connection 160A,160B, and a larger scale computing system 180. It will be understood bythose skilled in the art that these are merely exemplary components, andthat the present invention is by no means limited to any particularimplementation or combination of such systems. For example, client-sidesystem 150 could also be implemented as a computer peripheral, a PDA, aspart of a cell-phone, as part of an INTERNET-adapted appliance, anINTERNET linked kiosk, etc. Similarly, while an INTERNET connection isdepicted for data link 160A, it is apparent that any channel that issuitable for carrying data between client system 150 and server system180 will suffice, including a wireless link, an RF link, an IR link, aLAN, and the like. Finally, it will be further appreciated that serversystem 180 may be a single, large-scale system, or a collection ofsmaller systems interlinked to support a number of potential networkusers.

Initially speech input is provided in the form of a question or queryarticulated by the speaker at the client's machine or personal accessoryas a speech utterance. This speech utterance is captured and partiallyprocessed by NLQS client-side software 155 resident in the client'smachine. To facilitate and enhance the human-like aspects of theinteraction, the question is presented in the presence of an animatedcharacter 157 visible to the user who assists the user as a personalinformation retriever/agent. The agent can also interact with the userusing both visible text output on a monitor/display (not shown) and/orin audible form using a text to speech engine 159. The output of thepartial processing done by SRE 155 is a set of speech vectors that aretransmitted over communication channel 160 that links the user's machineor personal accessory to a server or servers via the INTERNET or awireless gateway that is linked to the INTERNET as explained above. Atserver 180, the partially processed speech signal data is handled by aserver-side SRE 182, which then outputs recognized speech textcorresponding to the user's question. Based on this user questionrelated text, a text-to-query converter 184 formulates a suitable querythat is used as input to a database processor 186. Based on the query,database processor 186 then locates and retrieves an appropriate answerusing a customized SQL query from database 188. A Natural LanguageEngine 190 facilitates structuring the query to database 188. After amatching answer to the user's question is found, the former istransmitted in text form across data link 160B, where it is convertedinto speech by text to speech engine 159, and thus expressed as oralfeedback by animated character agent 157.

Because the speech processing is broken up in this fashion, it ispossible to achieve real-time, interactive, human-like dialog consistingof a large, controllable set of questions/answers. The assistance of theanimated agent 157 further enhances the experience, making it morenatural and comfortable for even novice users. To make the speechrecognition process more reliable, context-specific grammars anddictionaries are used, as well as natural language processing routinesat NLE 190, to analyze user questions lexically. While context-specificprocessing of speech data is known in the art (see e.g., U.S. Pat. Nos.5,960,394, 5,867,817, 5,758,322 and 5,384,892 incorporated by referenceherein) the present inventors are unaware of any such implementation asembodied in the present inventions. The text of the user's question iscompared against text of other questions to identify the question posedby the user by DB processor/engine (DBE) 186. By optimizing theinteraction and relationship of the SR engines 155 and 182, the NLProutines 190, and the dictionaries and grammars, an extremely fast andaccurate match can be made, so that a unique and responsive answer canbe provided to the user.

On the server side 180, interleaved processing further accelerates thespeech recognition process. In simplified terms, the query is presentedsimultaneously both to NLE 190 after the query is formulated, as well asto DBE 186. NLE 190 and SRE 182 perform complementary functions in theoverall recognition process. In general, SRE 182 is primarilyresponsible for determining the identity of the words articulated by theuser, while NLE 190 is responsible for the linguistic morphologicalanalysis of both the user's query and the search results returned afterthe database query.

After the user's query is analyzed by NLE 190 some parameters areextracted and sent to the DBProcess. Additional statistics are stored inan array for the 2^(nd) step of processing. During the 2^(nd) step of2-step algorithm, the recordset of preliminary search results are sentto the NLE 160 for processing. At the end of this 2^(nd) step, thesingle question that matches the user's query is sent to the DBProcesswhere further processing yields the paired answer that is paired withthe single best stored question.

Thus, the present invention uses a form of natural language processing(NLP) to achieve optimal performance in a speech based web applicationsystem. While NLP is known in the art, prior efforts in Natural LanguageProcessing (NLP) work nonetheless have not been well integrated withSpeech Recognition (SR) technologies to achieve reasonable results in aweb-based application environment. In speech recognition, the result istypically a lattice of possible recognized words each with someprobability of fit with the speech recognizer. As described before, theinput to a typical NLP system is typically a large linguistic unit. TheNLP system is then charged with the parsing, understanding and indexingof this large linguistic unit or set of transcribed utterances. Theresult of this NLP process is to understand lexically or morphologicallythe entire linguistic unit as opposed to word recognition. Put anotherway, the linguistic unit or sentence of connected words output by theSRE has to be understood lexically, as opposed to just being“recognized”.

As indicated earlier, although speech recognition technology has beenavailable for several years, the technical requirements for the NLQSinvention have placed severe restrictions on the specifications for thespeech recognition accuracy that is required for an application thatcombines speech recognition and natural language processing to worksatisfactorily. In realizing that even with the best of conditions, itmight be not be possible to achieve the perfect 100% speech recognitionaccuracy that is required, the present invention employs an algorithmthat balances the potential risk of the speech recognition process withthe requirements of the natural language processing so that even incases where perfect speech recognition accuracy is not achieved for eachword in the query, the entire query itself is nonetheless recognizedwith sufficient accuracy.

This recognition accuracy is achieved even while meeting very stringentuser constraints, such as short latency periods of 3 to 5 seconds(ideally—ignoring transmission latencies which can vary) for respondingto a speech-based query, and for a potential set of 100-250 queryquestions. This quick response time gives the overall appearance andexperience of a real-time discourse that is more natural and pleasantfrom the user's perspective. Of course, non-real time applications, suchas translation services for example, can also benefit from the presentteachings as well, since a centralized set of HMMs, grammars,dictionaries, etc., are maintained.

General Aspects of Speech Recognition Used in the Present Inventions

General background information on speech recognition can be found in theprior art references discussed above and incorporated by referenceherein. Nonetheless, a discussion of some particular exemplary forms ofspeech recognition structures and techniques that are well-suited forNLQS 100 is provided next to better illustrate some of thecharacteristics, qualities and features of the present inventions.

Speech recognition technology is typically of two types—speakerindependent and speaker dependent. In speaker-dependent speechrecognition technology, each user has a voice file in which a sample ofeach potentially recognized word is stored. Speaker-dependent speechrecognition systems typically have large vocabularies and dictionariesmaking them suitable for applications as dictation and texttranscribing. It follows also that the memory and processor resourcerequirements for the speaker-dependent can be and are typically largeand intensive.

Conversely speaker-independent speech recognition technology allows alarge group of users to use a single vocabulary file. It follows thenthat the degree of accuracy that can be achieved is a function of thesize and complexity of the grammars and dictionaries that can besupported for a given language. Given the context of applications forwhich NLQS, the use of small grammars and dictionaries allow speakerindependent speech recognition technology to be implemented in NLQS.

The key issues or requirements for either type—speaker-independent orspeaker-dependent, are accuracy and speed. As the size of the userdictionaries increase, the speech recognition accuracy metric—word errorrate (WER) and the speed of recognition decreases. This is so becausethe search time increases and the pronunciation match becomes morecomplex as the size of the dictionary increases.

The basis of the NLQS speech recognition system is a series of HiddenMarkov Models (HMM), which, as alluded to earlier, are mathematicalmodels used to characterize any time varying signal. Because parts ofspeech are considered to be based on an underlying sequence of one ormore symbols, the HMM models corresponding to each symbol are trained onvectors from the speech waveforms. The Hidden Markov Model is a finiteset of states, each of which is associated with a (generallymulti-dimensional) probability distribution. Transitions among thestates are governed by a set of probabilities called transitionprobabilities. In a particular state an outcome or observation can begenerated, according to an associated probability distribution. Thisfinite state machine changes state once every time unit, and each time tsuch that a state j is entered, a spectral parameter vector O_(t) isgenerated with probability density B_(j)(O_(t)). It is only the outcome,not the state which is visible to an external observer and thereforestates are “hidden” to the outside; hence the name Hidden Markov Model.

In isolated speech recognition, it is assumed that the sequence ofobserved speech vectors corresponding to each word can each be describedby a Markov model as follows:O=o₁, o₂, . . . o_(T)  (1-1)

-   -   where o_(t) is a speech vector observed at time t. The isolated        word recognition then is to compute:        arg max {P(w_(i)|O)}  (1-2)

By using Bayes' Rule,{P(w _(i) |O)}=[P(O|w _(i))P(w _(i))]/P(O)  (1-3)

In the general case, the Markov model when applied to speech alsoassumes a finite state machine which changes state once every time unitand each time that a state j is entered, a speech vector o_(t) isgenerated from the probability density b_(j) (o_(t)). Furthermore, thetransition from state i to state j is also probabilistic and is governedby the discrete probability a_(ij).

For a state sequence X, the joint probability that O is generated by themodel M moving through a state sequence X is the product of thetransition probabilities and the output probabilities. Only theobservation sequence is known—the state sequence is hidden as mentionedbefore.

Given that X is unknown, the required likelihood is computed by summingover all possible state sequences X=x(1), x(2), x(3), . . . x(T), thatisP(O|M)=Σ{a _(x(0)x(1)) Πb(x)(o _(t))a _(x(t)x(t+1))}

Given a set of models M_(i), corresponding to words w_(i) equation 1-2is solved by using 1-3 and also by assuming that:P(O|w _(i))=P(O|M _(i))

All of this assumes that the parameters {a_(ij)} and {b_(j)(o_(t))} areknown for each model M_(i). This can be done, as explained earlier, byusing a set of training examples corresponding to a particular model.Thereafter, the parameters of that model can be determined automaticallyby a robust and efficient re-estimation procedure. So if a sufficientnumber of representative examples of each word are collected, then a HMMcan be constructed which simply models all of the many sources ofvariability inherent in real speech. This training is well-known in theart, so it is not described at length herein, except to note that thedistributed architecture of the present invention enhances the qualityof HMMs, since they are derived and constituted at the server side,rather than the client side. In this way, appropriate samples from usersof different geographical areas can be easily compiled and analyzed tooptimize the possible variations expected to be seen across a particularlanguage to be recognized. Uniformity of the speech recognition processis also well-maintained, and error diagnostics are simplified, sinceeach prospective user is using the same set of HMMs during therecognition process.

To determine the parameters of a HMM from a set of training samples, thefirst step typically is to make a rough guess as to what they might be.Then a refinement is done using the Baum-Welch estimation formulae. Bythese formulae, the maximum likelihood estimates of μ_(j) (where μ_(j)is mean vector and Σ_(j) is covariance matrix ) is:μ_(j)=Σ^(T) _(t=1) L _(j)(t)o _(t)/[Σ^(T) _(t=1) L _(j)(t)o _(t)]

A forward-backward algorithm is next used to calculate the probabilityof state occupation L_(j)(t). If the forward probability α_(j)(t) forsome model M with N states is defined as:α_(j)(t)=P(o ₁ , . . . , o _(t) , x(t)=j|M)

This probability can be calculated using the recursion:α_(j)(t)=[Σ^(N−1) _(i=2)α(t−1)a _(ij) ]b _(j)(o _(t))

Similarly the backward probability can be computed using the recursion:β_(j)(t)=Σ^(N−1) _(j=2) a _(ij) b _(j)(o _(t+1))(t+1)

Realizing that the forward probability is a joint probability and thebackward probability is a conditional probability, the probability ofstate occupation is the product of the two probabilities:αj(t)β_(j)(t)=P(O, x(t)=j|M)

Hence the probability of being in state j at a time t is:L _(j)(t)=1/P[α_(j)(t)β_(j)(t)]

-   -   where P=P(O|M)

To generalize the above for continuous speech recognition, we assume themaximum likelihood state sequence where the summation is replaced by amaximum operation. Thus for a given model M, let φj(t) represent themaximum likelihood of observing speech vectors o₁ to o_(t) and beingused in state j at time t:φ_(j)(t)=max{φj(t)(t−1)α_(ij)}β_(j)(o _(t))

Expressing this logarithmically to avoid underflow, this likelihoodbecomes:ψ_(j)(t)=max {ψ_(i)(t−1)+log(α_(ij))}+log(b _(j)(o _(t))

This is also known as the Viterbi algorithm. It can be visualized asfinding the best path through a matrix where the vertical dimensionrepresents the states of the HMM and horizontal dimension representsframes of speech i.e. time. To complete the extension to connectedspeech recognition, it is further assumed that each HMM representing theunderlying sequence is connected. Thus the training data for continuousspeech recognition should consist of connected utterances; however, theboundaries between words do not have to be known.

To improve computational speed/efficiency, the Viterbi algorithm issometimes extended to achieve convergence by using what is known as aToken Passing Model. The token passing model represents a partial matchbetween the observation sequence o₁ to o_(t) and a particular model,subject to the constraint that the model is in state j at time t. Thistoken passing model can be extended easily to connected speechenvironments as well if we allow the sequence of HMMs to be defined as afinite state network. A composite network that includes bothphoneme-based HMMs and complete words can be constructed so that asingle-best word can be recognized to form connected speech using wordN-best extraction from the lattice of possibilities. This composite formof HMM-based connected speech recognizer is the basis of the NLQS speechrecognizer module. Nonetheless, the present invention is not limited assuch to such specific forms of speech recognizers, and can employ othertechniques for speech recognition if they are otherwise compatible withthe present architecture and meet necessary performance criteria foraccuracy and speed to provide a real-time dialog experience for users.

The representation of speech for the present invention's HMM-basedspeech recognition system assumes that speech is essentially either aquasi-periodic pulse train (for voiced speech sounds) or a random noisesource (for unvoiced sounds). It may be modeled as two sources—one aimpulse train generator with pitch period P and a random noise generatorwhich is controlled by a voice/unvoiced switch. The output of the switchis then fed into a gain function estimated from the speech signal andscaled to feed a digital filter H(z) controlled by the vocal tractparameter characteristics of the speech being produced. All of theparameters for this model—the voiced/unvoiced switching, the pitchperiod for voiced sounds, the gain parameter for the speech signal andthe coefficient of the digital filter, vary slowly with time. Inextracting the acoustic parameters from the user's speech input so thatit can evaluated in light of a set of HMMs, cepstral analysis istypically used to separate the vocal tract information from theexcitation information. The cepstrum of a signal is computed by takingthe Fourier (or similar) transform of the log spectrum. The principaladvantage of extracting cepstral coefficients is that they arede-correlated and the diagonal covariances can be used with HMMs. Sincethe human ear resolves frequencies non-linearly across the audiospectrum, it has been shown that a front-end that operates in a similarnon-linear way improves speech recognition performance.

Accordingly, instead of a typical linear prediction-based analysis, thefront-end of the NLQS speech recognition engine implements a simple,fast Fourier transform based filter bank designed to give approximatelyequal resolution on the MeI-scale. To implement this filter bank, awindow of speech data (for a particular time frame) is transformed usinga software based Fourier transform and the magnitude taken. Each FFTmagnitude is then multiplied by the corresponding filter gain and theresults accumulated. The cepstral coefficients that are derived fromthis filter-bank analysis at the front end are calculated during a firstpartial processing phase of the speech signal by using a Discrete CosineTransform of the log filter bank amplitudes. These cepstral coefficientsare called MeI-Frequency Cepstral Coefficients (MFCC) and they representsome of the speech parameters transferred from the client side tocharacterize the acoustic features of the user's speech signal. Theseparameters are chosen for a number of reasons, including the fact thatthey can be quickly and consistently derived even across systems ofdisparate capabilities (i.e., for everything from a low power PDA to ahigh powered desktop system), they give good discrimination, they lendthemselves to a number of useful recognition related manipulations, andthey are relatively small and compact in size so that they can betransported rapidly across even a relatively narrow band link. Thus,these parameters represent the least amount of information that can beused by a subsequent server side system to adequately and quicklycomplete the recognition process.

To augment the speech parameters an energy term in the form of thelogarithm of the signal energy is added. Accordingly, RMS energy isadded to the 12 MFCC's to make 13 coefficients. These coefficientstogether make up the partially processed speech data transmitted incompressed form from the user's client system to the remote server side.

The performance of the present speech recognition system is enhancedsignificantly by computing and adding time derivatives to the basicstatic MFCC parameters at the server side. These two other sets ofcoefficients—the delta and acceleration coefficients representing changein each of the 13 values from frame to frame (actually measured acrossseveral frames), are computed during a second partial speech signalprocessing phase to complete the initial processing of the speechsignal, and are added to the original set of coefficients after thelatter are received. These MFCCs together with the delta andacceleration coefficients constitute the observation vector O_(t)mentioned above that is used for determining the appropriate HMM for thespeech data.

The delta and acceleration coefficients are computed using the followingregression formula:d _(t)=Σ^(θ) _(θ=1) [c _(t+θ) −c _(t−θ)]/2Σ^(θ) _(θ=1) ^(θ) ²

-   -   where d_(t) is a delta coefficient at time t computed in terms        of the corresponding static coefficients:        d _(t) =[c _(t+θ) −c _(t−θ)]/2θ

In a typical stand-alone implementation of a speech recognition system,the entire SR engine runs on a single client. In other words, both thefirst and second partial processing phases above are executed by thesame DSP (or microprocessor) running a ROM or software code routine atthe client's computing machine.

In contrast, because of several considerations, specifically—cost,technical performance, and client hardware uniformity, the present NLQSsystem uses a partitioned or distributed approach. While some processingoccurs on the client side, the main speech recognition engine runs on acentrally located server or number of servers. More specifically, asnoted earlier, capture of the speech signals, MFCC vector extraction andcompression are implemented on the client's machine during a firstpartial processing phase. The routine is thus streamlined and simpleenough to be implemented within a browser program (as a plug in module,or a downloadable applet for example) for maximum ease of use andutility. Accordingly, even very “thin” client platforms can besupported, which enables the use of the present system across a greaternumber of potential sites. The primary MFCCs are then transmitted to theserver over the channel, which, for example, can include a dial-upINTERNET connection, a LAN connection, a wireless connection and thelike. After decompression, the delta and acceleration coefficients arecomputed at the server to complete the initial speech processing phase,and the resulting observation vectors O_(t) are also determined.

General Aspects of Speech Recognition Engine

The speech recognition engine is also located on the server, and isbased on a HTK-based recognition network compiled from a word-levelnetwork, a dictionary and a set of HMMs. The recognition networkconsists of a set of nodes connected by arcs. Each node is either a HMMmodel instance or a word end. Each model node is itself a networkconsisting of states connected by arcs. Thus when fully compiled, aspeech recognition network consists of HMM states connected bytransitions. For an unknown input utterance with T frames, every pathfrom the start node to the exit node of the network passes through T HMMstates. Each of these paths has log probability which is computed bysumming the log probability of each individual transition in the pathand the log probability of each emitting state generating thecorresponding observation. The function of the Viterbi decoder is findthose paths through the network which have the highest log probability.This is found using the Token Passing algorithm. In a network that hasmany nodes, the computation time is reduced by only allowing propagationof those tokens which will have some chance of becoming winners. Thisprocess is called pruning.

Natural Language Processor

In a typical natural language interface to a database, the user enters aquestion in his/her natural language, for example, English. The systemparses it and translates it to a query language expression. The systemthen uses the query language expression to process the query and if thesearch is successful, a recordset representing the results is displayedin English either formatted as raw text or in a graphical form. For anatural language interface to work well involves a number of technicalrequirements.

For example, it needs to be robust—in the sentence ‘What's thedepartments turnover’ it needs to decide that the word whats=what's=whatis. And it also has to determine that departments=department's. Inaddition to being robust, the natural language interface has todistinguish between the several possible forms of ambiguity that mayexist in the natural language—lexical, structural, reference andellipsis ambiguity. All of these requirements, in addition to thegeneral ability to perform basic linguistic morphological operations oftokenization, tagging and grouping, are implemented within the presentinvention.

Tokenization is implemented by a text analyzer which treats the text asa series of tokens or useful meaningful units that are larger thanindividual characters, but smaller than phrases and sentences. Theseinclude words, separable parts of words, and punctuation. Each token isassociated with an offset and a length. The first phase of tokenizationis the process of segmentation which extracts the individual tokens fromthe input text and keeps track of the offset where each token originatedin the input text. The tokenizer output lists the offset and categoryfor each token. In the next phase of the text analysis, the tagger usesa built-in morphological analyzer to look up each word/token in a phraseor sentence and internally lists all parts of speech. The output is theinput string with each token tagged with a parts of speech notation.Finally the grouper which functions as a phrase extractor or phraseanalyzer, determines which groups of words form phrases. These threeoperations which are the foundations for any modern linguisticprocessing schemes, are fully implemented in optimized algorithms fordetermining the single-best possible answer to the user's question.

SQL Database and Full-Text Query

Another key component of present system is a SQL-database. This databaseis used to store text, specifically the answer-question pairs are storedin full-text tables of the database. Additionally, the full-text searchcapability of the database allows full-text searches to be carried out.

While a large portion of all digitally stored information is in the formof unstructured data, primarily text, it is now possible to store thistextual data in traditional database systems in character-based columnssuch as varchar and text. In order to effectively retrieve textual datafrom the database, techniques have to be implemented to issue queriesagainst textual data and to retrieve the answers in a meaningful waywhere it provides the answers as in the case of the NLQS system.

There are two major types of textual searches: Property—This searchtechnology first applies filters to documents in order to extractproperties such as author, subject, type, word count, printed pagecount, and time last written, and then issues searches against thoseproperties; Full-text—this search technology first creates indexes ofall non-noise words in the documents, and then uses these indexes tosupport linguistic searches and proximity searches.

Two additional technologies are also implemented in this particularRDBMs: SQL Server also have been integrated: A Search service—afull-text indexing and search service that is called both index engineand search, and a parser that accepts full-text SQL extensions and mapsthem into a form that can be processed by the search engine.

The four major aspects involved in implementing full-text retrieval ofplain-text data from a full-text-capable database are: Managing thedefinition of the tables and columns that are registered for full-textsearches; Indexing the data in registered columns—the indexing processscans the character streams, determines the word boundaries (this iscalled word breaking), removes all noise words (this also is called stopwords), and then populates a full-text index with the remaining words;Issuing queries against registered columns for populated full-textindexes; Ensuring that subsequent changes to the data in registeredcolumns gets propagated to the index engine to keep the full-textindexes synchronized.

The underlying design principle for the indexing, querying, andsynchronizing processes is the presence of a full-text unique key column(or single-column primary key) on all tables registered for full-textsearches. The full-text index contains an entry for the non-noise wordsin each row together with the value of the key column for each row.

When processing a full-text search, the search engine returns to thedatabase the key values of the rows that match the search criteria.

The full-text administration process starts by designating a table andits columns of interest for full-text search. Customized NLQS storedprocedures are used first to register tables and columns as eligible forfull-text search. After that, a separate request by means of a storedprocedure is issued to populate the full-text indexes. The result isthat the underlying index engine gets invoked and asynchronous indexpopulation begins. Full-text indexing tracks which significant words areused and where they are located. For example, a full-text index mightindicate that the word “NLQS” is found at word number 423 and wordnumber 982 in the Abstract column of the DevTools table for the rowassociated with a ProductID of 6. This index structure supports anefficient search for all items containing indexed words as well asadvanced search operations, such as phrase searches and proximitysearches. (An example of a phrase search is looking for “whiteelephant,” where “white” is followed by “elephant”. An example of aproximity search is looking for “big” and “house” where “big” occursnear “house”.) To prevent the full-text index from becoming bloated,noise words such as “a,” “and,” and “the” are ignored.

Extensions to the Transact-SQL language are used to construct full-textqueries. The two key predicates that are used in the NLQS are CONTAINSand FREETEXT.

The CONTAINS predicate is used to determine whether or not values infull-text registered columns contain certain words and phrases.Specifically, this predicate is used to search for:

-   -   A word or phrase.    -   The prefix of a word or phrase.    -   A word or phrase that is near another.    -   A word that is an inflectional form of another (for example,        “drive” is the inflectional stem of “drives,” “drove,”        “driving,” and “driven”).    -   A set of words or phrases, each of which is assigned a different        weighting.

The relational engine within SQL Server recognizes the CONTAINS andFREETEXT predicates and performs some minimal syntax and semanticchecking, such as ensuring that the column referenced in the predicatehas been registered for full-text searches. During query execution, afull-text predicate and other relevant information are passed to thefull-text search component. After further syntax and semanticvalidation, the search engine is invoked and returns the set of uniquekey values identifying those rows in the table that satisfy thefull-text search condition. In addition to the FREETEXT and CONTAINS,other predicates such as AND, LIKE, NEAR are combined to create thecustomized NLQS SQL construct.

Full-Text Query Architecture of the SQL Database

The full-text query architecture is comprised of the following severalcomponents—Full-Text Query component, the SQL Server Relational Engine,the Full-Text provider and the Search Engine.

The Full-Text Query component of the SQL database accept a full-textpredicate or rowset-valued function from the SQL Server; transform partsof the predicate into an internal format, and sends it to SearchService, which returns the matches in a rowset. The rowset is then sentback to SQL Server. SQL Server uses this information to create theresultset that is then returned to the submitter of the query.

The SQL Server Relational Engine accepts the CONTAINS and FREETEXTpredicates as well as the CONTAINSTABLE( ) and FREETEXTTABLE( )rowset-valued functions. During parse time, this code checks forconditions such as attempting to query a column that has not beenregistered for full-text search. If valid, then at run time, theft_search_condition and context information is sent to the full-textprovider. Eventually, the full-text provider returns a rowset to SQLServer, which is used in any joins (specified or implied) in theoriginal query. The Full-Text Provider parses and validates theft_search_condition, constructs the appropriate internal representationof the full-text search condition, and then passes it to the searchengine. The result is returned to the relatiorial engine by means of arowset of rows that satisfy ft_search_condition.

Client Side System 150

The architecture of client-side system 150 of Natural Language QuerySystem 100 is illustrated in greater detail in FIGS. 2A-2C. Referring toFIG. 2A, the three main processes effectuated by Client System 150 areillustrated as follows: Initialization process 200A consisting of SRE201, Communication 202 and Microsoft (MS) Agent 203 routines; at FIG. 2Ban iterative process 200B consisting of two sub-routines: a) ReceiveUser Speech 208—made up of SRE 204 and Communication 205; and b) ReceiveAnswer from Server 207—made up of MS Speak Agent 206, Communication 209,Voice data file 210 and Text to Speech Engine 211. Finally, in FIG. 2Cun-initialization process 200C is made up of three sub-routines: SRE212, Communication 213, and MS Agent 214. Each of the above threeprocesses are described in detail in the following paragraphs. It willbe appreciated by those skilled in the art that the particularimplementation for such processes and routines will vary from clientplatform to platform, so that in some environments such processes may beembodied in hard-coded routines executed by a dedicated DSP, while inothers they may be embodied as software routines executed by a sharedhost processor, and in still others a combination of the two may beused.

Initialization at Client System 150

The initialization of the Client System 150 is illustrated in FIG. 2Dand is comprised generally of 3 separate initializing processes:client-side Speech Recognition Engine 220A, MS Agent 220B andCommunication processes 220C.

Initialization of Speech Recognition Engine 220A

Speech Recognition Engine 155 is initialized and configured using theroutines shown in 220A. First, an SRE COM Library is initialized. Next,memory 220 is allocated to hold Source and Coder objects, are created bya routine 221. Loading of configuration file 221A from configurationdata file 221B also takes place at the same time that the SRE Library isinitialized. In configuration file 221B, the type of the input of Coderand the type of the output of the Coder are declared. The structure,operation, etc. of such routines are well-known in the art, and they canbe implemented using a number of fairly straightforward approaches.Accordingly, they are not discussed in detail herein. Next, Speech andSilence components of an utterance are calibrated using a routine 222,in a procedure that is also well-known in the art. To calibrate thespeech and silence components, the user preferably articulates asentence that is displayed in a text box on the screen. The SRE librarythen estimates the noise and other parameters required to find silenceand speech elements of future user utterances.

Initialization of MS Agent 220B

The software code used to initialize and set up a MS Agent 220B is alsoillustrated in FIG. 2D. The MS Agent 220B routine is responsible forcoordinating and handling the actions of the animated agent 157 (FIG.1). This initialization thus consists of the following steps:

-   -   1. Initialize COM library 223. This part of the code initializes        the COM library, which is required to use ActiveX Controls,        which controls are well-known in the art.    -   2. Create instance of Agent Server 224—this part of the code        creates an instance of Agent ActiveX control.    -   3. Loading of MS Agent 225—this part of the code loads MS Agent        character from a specified file 225A containing general        parameter data for the Agent Character, such as the overall        appearance, shape, size, etc.    -   4. Get Character Interface 226—this part of the code gets an        appropriate interface for the specified character; for example,        characters may have different control/interaction capabilities        that can be presented to the user.    -   5. Add Commands to Agent Character Option 227—this part of the        code adds commands to an Agent Properties sheet, which sheet can        be accessed by clicking on the icon that appears in the system        tray, when the Agent character is loaded e.g., that the        character can Speak, how he/she moves, TTS Properties, etc.    -   6. Show the Agent Character 228—this part of the code displays        the Agent character on the screen so it can be seen by the user;    -   7. AgentNotifySink—to handle events. This part of the code        creates AgentNotifySink object 229, registers it at 230 and then        gets the Agent Properties interface 231. The property sheet for        the Agent character is assigned using routine 232.    -   8. Do Character Animations 233—This part of the code plays        specified character animations to welcome the user to NLQS 100.

The above then constitutes the entire sequence required to initializethe MS Agent. As with the SRE routines, the MS Agent routines can beimplemented in any suitable and conventional fashion by those skilled inthe art based on the present teachings. The particular structure,operation, etc. of such routines is not critical, and thus they are notdiscussed in detail herein.

In a preferred embodiment, the MS Agent is configured to have anappearance and capabilities that are appropriate for the particularapplication. For instance, in a remote learning application, the agenthas the visual form and mannerisms/attitude/gestures of a collegeprofessor. Other visual props (blackboard, textbook, etc.) may be usedby the agent and presented to the user to bring to mind the experienceof being in an actual educational environment. The characteristics ofthe agent may be configured at the client side 150, and/or as part ofcode executed by a browser program (not shown) in response toconfiguration data and commands from a particular web page. For example,a particular website offering medical services may prefer to use avisual image of a doctor. These and many other variations will beapparent to those skilled in the art for enhancing the human-like,real-time dialog experience for users.

Initialization of Communication Link 160A

The initialization of Communication Link 160A is shown with reference toprocess 220C FIG. 2D. Referring to FIG. 2D, this initialization consistsof the following code components: Open INTERNET Connection 234—this partof the code opens an INTERNET Connection and sets the parameter for theconnection. Then Set Callback Status routine 235 sets the callbackstatus so as to inform the user of the status of connection. FinallyStart New HTTP INTERNET Session 236 starts a new INTERNET session. Thedetails of Communications Link 160 and the set up process 220C are notcritical, and will vary from platform to platform. Again, in some cases,users may use a low-speed dial-up connection, a dedicated high speedswitched connection (T1 for example), an always-on xDSL connection, awireless connection, and the like.

Iterative Processing of Queries/Answers

As illustrated in FIG. 3, once initialization is complete, an iterativequery/answer process is launched when the user presses the Start Buttonto initiate a query. Referring to FIG. 3, the iterative query/answerprocess consists of two main sub-processes implemented as routines onthe client side system 150: Receive User Speech 240 and Receive UserAnswer 243. The Receive User Speech 240 routine receives speech from theuser (or another audio input source), while the Receive User Answer 243routine receives an answer to the user's question in the form of textfrom the server so that it can be converted to speech for the user bytext-to-speech engine 159. As used herein, the term “query” is referredto in the broadest sense to refer, to either a question, a command, orsome form of input used as a control variable by the system. Forexample, a query may consist of a question directed to a particulartopic, such as “what is a network” in the context of a remote learningapplication. In an e-commerce application a query might consist of acommand to “list all books by Mark Twain” for example. Similarly, whilethe answer in a remote learning application consists of text that isrendered into audible form by the text to speech engine 159, it couldalso be returned as another form of multi-media information, such as agraphic image, a sound file, a video file, etc. depending on therequirements of the particular application. Again, given the presentteachings concerning the necessary structure, operation, functions,performance, etc., of the client-side Receive User Speech 240 andReceiver User Answer 243 routines, one of ordinary skill in the artcould implement such in a variety of ways.

Receive User Speech—As illustrated in FIG. 3, the Receive User Speechroutine 240 consists of a SRE 241 and a Communication 242 process, bothimplemented again as routines on the client side system 150 forreceiving and partially processing the user's utterance. SRE routine 241uses a coder 248 which is prepared so that a coder object receivesspeech data from a source object. Next the Start Source 249 routine isinitiated. This part of the code initiates data retrieval using thesource Object which will in turn be given to the Coder object. Next,MFCC vectors 250 are extracted from the Speech utterance continuouslyuntil silence is detected. As alluded to earlier, this represents thefirst phase of processing of the input speech signal, and in a preferredembodiment, it is intentionally restricted to merely computing the MFCCvectors for the reasons already expressed above. These vectors includethe 12 cepstral coefficients and the RMS energy term, for a total of 13separate numerical values for the partially processed speech signal.

In some environments, nonetheless, it is conceivable that the MFCC deltaparameters and MFCC acceleration parameters can also be computed atclient side system 150, depending on the computation resourcesavailable, the transmission bandwidth in data link 160A available toserver side system 180, the speed of a transceiver used for carryingdata in the data link, etc. These parameters can be determinedautomatically by client side system upon initializing SRE 155 (usingsome type of calibration routine to measure resources), or by directuser control, so that the partitioning of signal processingresponsibilities can be optimized on a case-by-case basis. In someapplications, too, server side system 180 may lack the appropriateresources or routines for completing the processing of the speech inputsignal. Therefore, for some applications, the allocation of signalprocessing responsibilities may be partitioned differently, to the pointwhere in fact both phases of the speech signal processing may take placeat client side system 150 so that the speech signal is completely—ratherthan partially—processed and transmitted for conversion into a query atserver side system 180.

Again in a preferred embodiment, to ensure reasonable accuracy andreal-time performance from a query/response perspective, sufficientresources are made available in a client side system so that 100 framesper second of speech data can be partially processed and transmittedthrough link 160A. Since the least amount of information that isnecessary to complete the speech recognition process (only 13coefficients) is sent, the system achieves a real-time performance thatis believed to be highly optimized, because other latencies (i.e.,client-side computational latencies, packet formation latencies,transmission latencies) are minimized. It will be apparent that theprinciples of the present invention can be extended to other SRapplications where some other methodology is used for breaking down thespeech input signal by an SRE (i.e., non-MFCC based). The only criteriais that the SR processing be similarly dividable into multiple phases,and with the responsibility for different phases being handled onopposite sides of link 160A depending on overall system performancegoals, requirements and the like. This functionality of the presentinvention can thus be achieved on a system-by-system basis, with anexpected and typical amount of optimization being necessary for eachparticular implementation.

Thus, the present invention achieves a response rate performance that istailored in accordance with the amount of information that is computed,coded and transmitted by the client side system 150. So in applicationswhere real-time performance is most critical, the least possible amountof extracted speech data is transmitted to reduce these latencies, and,in other applications, the amount of extracted speech data that isprocessed, coded and transmitted can be varied.

Communication—transmit communication module 242 is used to implement thetransport of data from the client to the server over the data link 160A,which in a preferred embodiment is the INTERNET. As explained above, thedata consists of encoded MFCC vectors that will be used at thenserver-side of the Speech Recognition engine to complete the speechrecognition decoding. The sequence of the communication is as follows:

OpenHTTPRequest 251—this part of the code first converts MFCC vectors toa stream of bytes, and then processes the bytes so that it is compatiblewith a protocol known as HTTP. This protocol is well-known in the art,and it is apparent that for other data links another suitable protocolwould be used.

-   -   1. Encode MFCC Byte Stream 251—this part of the code encodes the        MFCC vectors, so that they can be sent to the server via HTTP.    -   2. Send data 252—this part of the code sends MFCC vectors to the        server using the INTERNET connection and the HTTP protocol.

Wait for the Server Response 253—this part of the code monitors the datalink 160A a response from server side system 180 arrives. In summary,the MFCC parameters are extracted or observed on-the-fly from the inputspeech signal. They are then encoded to a HTTP byte stream and sent in astreaming fashion to the server before the silence is detected—i.e. sentto server side system 180 before the utterance is complete. This aspectof the invention also facilitates a real-time behavior, since data canbe transmitted and processed even while the user is still speaking.

Receive Answer from Server 243 is comprised of the following modules asshown in FIG. 3: MS Agent 244, Text-to-Speech Engine 245 and receivecommunication modules 246. All three modules interact to receive theanswer from server side system 180. As illustrated in FIG. 3, thereceive communication process consists of three separate processesimplemented as a receive routine on client side system 150: a Receivethe Best Answer 258 receives the best answer over data link 160B (theHTTP communication channel). The answer is de-compressed at 259 and thenthe answer is passed by code 260 to the MS Agent 244, where it isreceived by code portion 254. A routine 255 then articulates the answerusing text-to-speech engine 257. Of course, the text can also bedisplayed for additional feedback purposes on a monitor used with clientside system 150. The text to speech engine uses a natural language voicedata file 256 associated with it that is appropriate for the particularlanguage application (i.e., English, French, German, Japanese, etc.). Asexplained earlier when the answer is something more than text, it can betreated as desired to provide responsive information to the user, suchas with a graphics image, a sound, a video clip, etc.

Uninitialization

The un-initialization routines and processes are illustrated in FIG. 4.Three functional modules are used for un-initializing the primarycomponents of the client side system 150; these include SRE 270,Communications 271 and MS Agent 272 un-initializing routines. Toun-initialize SRE 220A, memory that was allocated in the initializationphase is de-allocated by code 273 and objects created during suchinitialization phase are deleted by code 274. Similarly, as illustratedin FIG. 4, to un-initialize Communications module 220C the INTERNETconnection previously established with the server is closed by codeportion 275 of the Communication Un-initialization routine 271. Next theINTERNET session created at the time of initialization is also closed byroutine 276. For the un-initialization of the MS Agent 220B, asillustrated in FIG. 4, MS Agent Un-initialization routine 272 firstreleases the Commands Interface 227 using routine 277. This releases thecommands added to the property sheet during loading of the agentcharacter by routine 225. Next the Character Interface initialized byroutine 226 is released by routine 278 and the Agent is unloaded at 279.The Sink Object Interface is then also released 280 followed by therelease of the Property Sheet Interface 281. The Agent Notify Sink 282then un-registers the Agent and finally the Agent Interface 283 isreleased which releases all the resources allocated duringinitialization steps identified in FIG. 2D.

It will be appreciated by those skilled in the art that the particularimplementation for such un-initialization processes and routines in FIG.4 will vary from client platform to client platform, as for the otherroutines discussed above. The structure, operation, etc. of suchroutines are well-known in the art, and they can be implemented using anumber of fairly straightforward approaches without undue effort.Accordingly, they are not discussed in detail herein.

Description of Server Side System 180

Introduction

A high level flow diagram of the set of preferred processes implementedon server side system 180 of Natural Language Query System 100 isillustrated in FIGS. 11A through FIG. 11C. In a preferred embodiment,this process consists of a two step algorithm for completing theprocessing of the speech input signal, recognizing the meaning of theuser's query, and retrieving an appropriate answer/response for suchquery.

The 1^(st) step as illustrated in FIG. 11A can be considered ahigh-speed first-cut pruning mechanism, and includes the followingoperations: after completing processing of the speech input signal, theuser's query is recognized at step 1101, so that the text of the queryis simultaneously sent to Natural Language Engine 190 (FIG. 1) at step1107, and to DB Engine 186 (also FIG. 1) at step 1102. By “recognized”in this context it is meant that the user's query is converted into atext string of distinct native language words through the HMM techniquediscussed earlier.

At NLE 190, the text string undergoes morphological linguisticprocessing at step 1108: the string is tokenized the tags are tagged andthe tagged tokens are grouped Next the noun phrases (NP) of the stringare stored at 1109, and also copied and transferred for use by DB Engine186 during a DB Process at step 1110. As illustrated in FIG. 11A, thestring corresponding to the user's query which was sent to the DB Engine186 at 1102, is used together with the NP received from NLE 190 toconstruct an SQL Query at step 1103. Next, the SQL query is executed atstep 1104, and a record set of potential questions corresponding to theuser's query are received as a result of a full-text search at 1105,which are then sent back to NLE 190 in the form of an array at step1106.

As can be seen from the above, this first step on the server sideprocessing acts as an efficient and fast pruning mechanism so that theuniverse of potential “hits” corresponding to the user's actual query isnarrowed down very quickly to a manageable set of likely candidates in avery short period of time.

Referring to FIG. 11B, in contrast to the first step above, the 2^(nd)step can be considered as the more precise selection portion of therecognition process. It begins with linguistic processing of each of thestored questions in the array returned by the full-text search processas possible candidates representing the user's query. Processing ofthese stored questions continues in NLE 190 as follows: each question inthe array of questions corresponding to the record set returned by theSQL full-text search undergoes morphological linguistic processing atstep 1111: in this operation, a text string corresponding to theretrieved candidate question is tokenized, the tags are tagged and thetagged tokens are grouped. Next, noun phrases of the string are computedand stored at step 1112. This process continues iteratively at point1113, and the sequence of steps at 1118, 1111, 1112, 1113 are repeatedso that an NP for each retrieved candidate question is computed andstored. Once an NP is computed for each of the retrieved candidatequestions of the array, a comparison is made between each such retrievedcandidate question and the user's query based on the magnitude of the NPvalue at step 1114. This process is also iterative in that steps 1114,1115, 1116, 1119 are repeated so that the comparison of the NP for eachretrieved candidate question with that of the NP of the user's query iscompleted. When there are no more stored questions in the array to beprocessed at step 1117, the stored question that has the maximum NPrelative to the user's query, is identified at 1117A as the storedquestion which best matches the user's query.

Notably, it can be seen that the second step of the recognition processis much more computationally intensive than the first step above,because several text strings are tokenized, and a comparison is made ofseveral NPs. This would not be practical, nonetheless, if it were notfor the fact that the first step has already quickly and efficientlyreduced the candidates to be evaluated to a significant degree. Thus,this more computationally intensive aspect of the present invention isextremely valuable, however because it yields extremely high accuracy inthe overall query recognition process. In this regard, therefore, thissecond step of the query recognition helps to ensure the overallaccuracy of the system, while the first step helps to maintain asatisfactory speed that provides a real-time feel for the user.

As illustrated in FIG. 11C, the last part of the query/response processoccurs by providing an appropriate matching answer/response to the user.Thus, an identity of a matching stored question is completed at step1120. Next a file path corresponding to an answer of the identifiedmatching question is extracted at step 1121. Processing continues sothat the answer is extracted from the file path at 1122 and finally theanswer is compressed and sent to client side system 150 at step 1123.

The discussion above is intended to convey a general overview of theprimary components, operations, functions and characteristics of thoseportions of NLQS system 100 that reside on server side system 180. Thediscussion that follows describes in more detail the respectivesub-systems.

Software Modules Used in Server Side System 180

The key software modules used on server-side system 180 of the NLQSsystem are illustrated in FIG. 5. These include generally the followingcomponents: a Communication module 500—identified as CommunicationServerISAPI 500A (which is executed by SRE Server-side 182—FIG. 1 and isexplained in more detail below), and a database process DBProcess module501 (executed by DB Engine 186—FIG. 1). Natural language engine module500C (executed by NLE 190—FIG. 1) and an interface 500B between the NLEprocess module 500C and the DBProcess module 500B. As shown here,CommunicationServerlSAPI 500A includes a server-side speech recognitionengine and appropriate communication interfaces required between clientside system 150 and server side system 180. As further illustrated inFIG. 5, server-side logic of Natural Language Query System 100 also canbe characterized as including two dynamic link library components:CommunicationServerlSAPI 500 and DBProcess 501. TheCommunicationServerlASPI 500 is comprised of 3 sub-modules: Server-sideSpeech Recognition Engine module 500A; Interface module 500B betweenNatural Language Engine modules 500C and DBProcess 501; and the NaturalLanguage Engine modules 500C.

DB Process 501 is a module whose primary function is to connect to a SQLdatabase and to execute an SQL query that is composed in response to theuser's query. In addition, this module interfaces with logic thatfetches the correct answer from a file path once this answer is passedto it from the Natural Language Engine module 500C.

Speech Recognition Sub-System 182 on Server-Side System 180

The server side speech recognition engine module 500A is a set ofdistributed components that perform the necessary functions andoperations of speech recognition engine 182 (FIG. 1) at server-side 180.These components can be implemented as software routines that areexecuted by server side 180 in conventional fashion. Referring to FIG.4A, a more detailed break out of the operation of the speech recognitioncomponents 600 at the server-side can be seen as follows:

Within a portion 601 of the server side SRE module 500A, the binary MFCCvector byte stream corresponding to the speech signal's acousticfeatures extracted at client side system 150 and sent over thecommunication channel 160 is received. The MFCC acoustic vectors aredecoded from the encoded HTTP byte stream as follows: Since the MFCCvectors contain embedded NULL characters, they cannot be transferred inthis form to server side system 180 as such using HTTP protocol. Thusthe MFCC vectors are first encoded at client-side 150 beforetransmission in such a way that all the speech data is converted into astream of bytes without embedded NULL characters in the data. At thevery end of the byte stream a single NULL character is introduced toindicate the termination of the stream of bytes to be transferred to theserver over the INTERNET 160A using HTTP protocol.

As explained earlier, to conserve latency time between the client andserver, a smaller number of bytes (just the 13 MFCC coefficients) aresent from client side system 150 to server side system 180. This is doneautomatically for each platform to ensure uniformity, or can be tailoredby the particular application environment—i.e., such as where it isdetermined that it will take less time to compute the delta andacceleration coefficients at the server (26 more calculations), than itwould take to encode them at the client, transmit them, and then decodethem from the HTTP stream. In general, since server side system 180 isusually better equipped to calculate the MFCC delta and accelerationparameters, this is a preferable choice. Furthermore, there is generallymore control over server resources compared to the client's resources,which means that future upgrades, optimizations, etc., can bedisseminated and shared by all to make overall system performance morereliable and predictable. So, the present invention can accommodate eventhe worst-case scenario where the client's machine may be quite thin andmay just have enough resources to capture the speech input data and dominimal processing.

Dictionary Preparation & Grammar Files

Referring to FIG. 4A, within code block 605, various options selected bythe user (or gleaned from the user's status within a particularapplication) are received. For instance, in the case of a preferredremote learning system, Course, Chapter and/or Section data items arecommunicated. In the case of other applications (such as e-commerce)other data options are communicated, such as the Product Class, ProductCategory, Product Brand, etc. loaded for viewing within his/her browser.These selected options are based on the context experienced by the userduring an interactive process, and thus help to limit and define thescope—i.e. grammars and dictionaries that will be dynamically loaded tospeech recognition engine 182 (FIG. 1) for Viterbi decoding duringprocessing of the user speech utterance. For speech recognition to beoptimized both grammar and dictionary files are used in a preferredembodiment. A Grammar file supplies the universe of available userqueries; i.e., all the possible words that are to be recognized. TheDictionary file provides phonemes (the information of how a word ispronounced—this depends on the specific native language files that areinstalled—for example, UK English or US English) of each word containedin the grammar file. It is apparent that if all the sentences for agiven environment that can be recognized were contained in a singlegrammar file then recognition accuracy would be deteriorated and theloading time alone for such grammar and dictionary files would impairthe speed of the speech recognition process.

To avoid these problems, specific grammars are dynamically loaded oractively configured as the current grammar according to the user'scontext, i.e., as in the case of a remote learning system, the Course,Chapter and/or Section selected. Thus the grammar and dictionary filesare loaded dynamically according to the given Course, Chapter and/orSection as dictated by the user, or as determined automatically by anapplication program executed by the user.

The second code block 602 implements the initialization of SpeechRecognition engine 182 (FIG. 1). The MFCC vectors received from clientside system 150 along with the grammar filename and the dictionary filenames are introduced to this block to initialize the speech decoder.

As illustrated in FIG. 4A, the initialization process 602 uses thefollowing sub-routines: A routine 602 a for loading an SRE library. Thisthen allows the creation of an object identified as External Source withcode 602 b using the received MFCC vectors. Code 602 c allocates memoryto hold the recognition objects. Routine 602 d then also creates andinitializes objects that are required for the recognition such as:Source, Coder, Recognizer and Results Loading of the Dictionary createdby code 602 e, Hidden Markov Models (HMMs) generated with code 602 f;and Loading of the Grammar file generated by routine 602 g.

Speech Recognition 603 is the next routine invoked as illustrated inFIG. 4A, and is generally responsible for completing the processing ofthe user speech signals input on the client side 150, which, asmentioned above, are preferably only partially processed (i.e., onlyMFCC vectors are computed during the first phase) when they aretransmitted across link 160. Using the functions created in ExternalSource by subroutine 602 b, this code reads MFCC vectors, one at a timefrom an External Source 603 a, and processes them in block 603 b torealize the words in the speech pattern that are symbolized by the MFCCvectors captured at the client. During this second phase, an additional13 delta coefficients and an additional 13 acceleration coefficients arecomputed as part of the recognition process to obtain a total of 39observation vectors O_(t) referred to earlier. Then, using a set ofpreviously defined Hidden Markov Models (HMMs), the words correspondingto the user's speech utterance are determined in the manner describedearlier. This completes the word “recognition” aspect of the queryprocessing, which results are used further below to complete the queryprocessing operations.

It will be appreciated by those skilled in the art that the distributednature and rapid performance of the word recognition process, by itself,is extremely useful and may be implemented in connection with otherenvironments that do not implicate or require additional queryprocessing operations. For example, some applications may simply useindividual recognized words for filling in data items on a computergenerated form, and the aforementioned systems and processes can providea rapid, reliable mechanism for doing so.

Once the user's speech is recognized, the flow of SRE 182 passes toUn-initialize SRE routine 604 where the speech engine is un-initializedas illustrated. In this block all the objects created in theinitialization block are deleted by routine 604 a, and memory allocatedin the initialization block during the initialization phase are removedby routine 604 b.

Again, it should be emphasized that the above are merely illustrative ofembodiments for implementing the particular routines used on a serverside speech recognition system of the present invention. Othervariations of the same that achieve the desired functionality andobjectives of the present invention will be apparent from the presentteachings.

Database Processor 186 Operation—DBProcess

Construction of an SQL Query used as part of the user query processingis illustrated in FIG. 4B, a SELECT SQL statement is preferablyconstructed using a conventional CONTAINS predicate. Module 950constructs the SQL query based on this SELECT SQL statement, which queryis used for retrieving the best suitable question stored in the databasecorresponding to the user's articulated query, (designated as Questionhere). A routine 951 then concatenates a table name with the constructedSELECT statement. Next, the number of words present in each Noun Phraseof Question asked by the user is calculated by routine 952. Then memoryis allocated by routine 953 as needed to accommodate all the wordspresent in the NP. Next a word List (identifying all the distinct wordspresent in the NP) is obtained by routine 954. After this, this set ofdistinct words are concatenated by routine 955 to the SQL Queryseparated with a NEAR ( ) keyword. Next, the AND keyword is concatenatedto the SQL Query by routine 956 after each NP. Finally memory resourcesare freed by code 957 so as to allocate memory to store the wordsreceived from NP for any next iteration. Thus, at the end of thisprocess, a completed SQL Query corresponding to the user's articulatedquestion is generated.

Connection to SQL Server—As illustrated in FIG. 4C, after the SQL Queryis constructed by routine 710, a routine 711 implements a connection tothe query database 717 to continue processing of the user query. Theconnection sequence and the subsequent retrieved record set isimplemented using routines 700 which include the following:

-   -   1. Server and database names are assigned by routine 711A to a        DBProcess member variable    -   2. A connection string is established by routine 711B;    -   3. The SQL Server database is connected under control of code        711C    -   4. The SQL Query is received by routine 712A    -   5. The SQL Query is executed by code 712B    -   6. Extract the total number of records retrieved by the        query—713    -   7. Allocate the memory to store the total number of paired        questions—713    -   8. Store the entire number of paired questions into an array—713

Once the Best Answer ID is received at 716 FIG. 4C, from the NLE 14(FIG. 5), the code corresponding 716C receives it passes it to code in716B where the path of the Answer file is determined using the recordnumber. Then the file is opened 716C using the path passed to it and thecontents of the file corresponding to the answer is read. Then theanswer is compressed by code in 716D and prepared for transmission overthe communication channel 160B (FIG. 1).

NLQS Database 188—Table Organization

FIG. 6 illustrates a preferred embodiment of a logical structure oftables used in a typical NLQS database 188 (FIG. 1). When NLQS database188 is used as part of NLQS query system 100 implemented as a remotelearning/training environment, this database will include anorganizational multi-level hierarchy that consists typically of a Course701, which is made of several chapters 702, 703, 704. Each of thesechapters can have one or more Sections 705, 706, 707 as shown forChapter 1. A similar structure can exist for Chapter 2, Chapter 3 . . .Chapter N. Each section has a set of one or more question—answer pairs708 stored in tables described in more detail below. While this is anappropriate and preferable arrangement for a training/learningapplication, it is apparent that other implementations would be possibleand perhaps more suitable for other applications such as e-commerce,e-support, INTERNET browsing, etc., depending on overall systemparameters.

It can be seen that the NLQS database 188 organization is intricatelylinked to the switched grammar architecture described earlier. In otherwords, the context (or environment) experienced by the user can bedetermined at any moment in time based at the selection made at thesection level, so that only a limited subset of question-answer pairs708 for example are appropriate for section 705. This in turn means thatonly a particular appropriate grammar for such question-answer pairs maybe switched in for handling user queries while the user is experiencingsuch context. In a similar fashion, an e-commerce application for anINTERNET based business may consist of a hierarchy that includes a firstlevel “home” page 701 identifying user selectable options (producttypes, services, contact information, etc.), a second level may includeone or more “product types” pages 702, 703, 704, a third page mayinclude particular product models 705, 706, 707, etc., and withappropriate question-answer pairs 708 and grammars customized forhandling queries for such product models. Again, the particularimplementation will vary from application to application, depending onthe needs and desires of such business, and a typical amount of routineoptimization will be necessary for each such application.

Table Organization

In a preferred embodiment, an independent database is used for eachCourse. Each database in turn can include three types of tables asfollows: a Master Table as illustrated in FIG. 7A, at least one ChapterTable as illustrated in FIG. 7B and at least one Section Table asillustrated in FIG. 7C.

As illustrated in FIG. 7A, a preferred embodiment of a Master Table hassix columns—Field Name 701A, Data Type 702A, Size 703A, Null 704A,Primary Key 705A and Indexed 706A. These parameters are well-known inthe art of database design and structure. The Master Table has only twofields—Chapter Name 707A and Section Name 708A. Both ChapterName andSection Name are commonly indexed.

A preferred embodiment of a Chapter Table is illustrated in FIG. 7B. Aswith the Master Table, the Chapter Table has six (6) columns—Field Name720, Data Type 721, Size 722, Null 723, Primary Key 724 and Indexed 725.There are nine (9) rows of data however, in this case,—Chapter_ID 726,Answer_ID 727, Section Name 728, Answer_Title 729, PairedQuestion 730,AnswerPath 731, Creator 732, Date of Creation 733 and Date ofModification 734.

An explanation of the Chapter Table fields is provided in FIG. 7C. Eachof the eight (8) Fields 720 has a description 735 and stores datacorresponding to:

-   -   AnswerID 727—an integer that is automatically incremented for        each answer given for user convenience    -   Section_Name 728—the name of the section to which the particular        record belongs. This field along with the AnswerID is used as        the primary key    -   Answer_Title 729—A short description of the title of the answer        to the user query    -   PairedQuestion 730—Contains one or more combinations of        questions for the related answers whose path is stored in the        next column AnswerPath    -   AnswerPath 731—contains the path of a file, which contains the        answer to the related questions stored in the previous column;        in the case of a pure question/answer application, this file is        a text file, but, as mentioned above, could be a multi-media        file of any kind transportable over the data link 160    -   Creator 732—Name of Content Creator    -   Date_of_Creation 733—Date on which content was created.    -   Date of Modification 734—Date on which content was changed or        modified

A preferred embodiment of a Section Table is illustrated in FIG. 7D. TheSection Table has six (6) columns—Field Name 740, Data Type 741, Size742, Null 743, Primary Key 744 and Indexed 745. There are seven (7) rowsof data—Answer_ID 746, Answer_Title 747, PairedQuestion 748, AnswerPath749, Creator 750, Date of Creation 751 and Date of Modification 752.These names correspond to the same fields, columns already describedabove for the Master Table and Chapter Table.

Again, this is a preferred approach for the specific type oflearning/training application described herein. Since the number ofpotential applications for the present invention is quite large, andeach application can be customized, it is expected that otherapplications (including other learning/training applications) willrequire and/or be better accommodated by another table, column, andfield structure/hierarchy.

Search Service and Search Engine—A query text search service isperformed by an SQL Search System 1000 shown in FIG. 10. This systemprovides querying support to process full-text searches. This is wherefull-text indexes reside.

In general, SQL Search System determines which entries in a databaseindex meet selection criteria specified by a particular text query thatis constructed in accordance with an articulated user speech utterance.The Index Engine 1011B is the entity that populates the Full-Text Indextables with indexes which correspond to the indexable units of text forthe stored questions and corresponding answers. It scans throughcharacter strings, determines word boundaries, removes all noise wordsand then populates the full-text index with the remaining words. Foreach entry in the full text database that meets the selection criteria,a unique key column value and a ranking value are returned as well.Catalog set 1013 is a file-system directory that is accessible only byan Administrator and Search Service 1010. Full-text indexes 1014 areorganized into full-text catalogs, which are referenced by easy tohandle names. Typically, full-text index data for an entire database isplaced into a single full-text catalog.

The schema for the full-text database as described (FIG. 7, FIG. 7A,FIG. 7B, FIG. 7C, FIG. 7D) is stored in the tables 1006 shown in FIG.10. Take for example, the tables required to describe the structure thestored question/answer pairs required for a particular course. For eachtable—Course Table, Chapter Table, Section Table, there arefields—column information that define each parameters that make up thelogical structure of the table. This information is stored in User andSystem tables 1006. The key values corresponding to those tables arestored as Full-Text catalogs 1013. So when processing a full-textsearch, the search engine returns to the SQL Server the key values ofthe rows that match the search criteria. The relational engine then usesthis information to respond to the query.

As illustrated in FIG. 10, a Full-Text Query Process is implemented asfollows:

-   -   1. A query 1001 that uses a SQL full-text construct generated by        DB processor 186 is submitted to SQL Relational Engine 1002.    -   2. Queries containing either a CONTAINS or FREETEXT predicate        are rewritten by routine 1003 so that a responsive rowset        returned later from Full-Text Provider 1007 will be        automatically joined to the table that the predicate is acting        upon. This rewrite is a mechanism used to ensure that these        predicates are a seamless extension to a traditional SQL Server.        After the compiled query is internally rewritten and checked for        correctness in item 1003, the query is passed to RUN TIME module        1004. The function of module 1004 is to convert the rewritten        SQL construct to a validated run-time process before it is sent        to the Full_Text Provider, 1007.    -   3. After this, Full-Text Provider 1007 is invoked, passing the        following information for the query:        -   a. A ft_search_condition parameter (this is a logical flag            indicating a full text search condition)        -   b. A name of a full-text catalog where a full-text index of            a table resides        -   c. A locale ID to be used for language (for example, word            breaking)        -   d. Identities of a database, table, and column to be used in            the query        -   e. If the query is comprised of more than one full-text            construct; when this is the case Full-text provider 1007 is            invoked separately for each construct.    -   4. SQL Relational Engine 1002 does not examine the contents of        ft_search_condition. Instead, this information is passed along        to Full-text provider 1007, which verifies the validity of the        query and then creates an appropriate internal representation of        the full-text search condition.    -   5. The query request/command 1008 is then passed to Querying        Support 1011A.    -   6. Querying Support 1012 returns a rowset 1009 from Full-Text        Catalog 1013 that contains unique key column values for any rows        that match the full-text search criteria. A rank value also is        returned for each row.    -   7. The rowset of key column values 1009 is passed to SQL        Relational Engine 1002. If processing of the query implicates        either a CONTAINSTABLE( ) or FREETEXTTABLE( ) function, RANK        values are returned; otherwise, any rank value is filtered out.    -   8. The rowset values 1009 are plugged into the initial query        with values obtained from relational database 1006, and a result        set 1015 is then returned for further processing to yield a        response to the user.

At this stage of the query recognition process, the speech utterance bythe user has already been rapidly converted into a carefully craftedtext query, and this text query has been initially processed so that aninitial matching set of results can be further evaluated for a finaldetermination of the appropriate matching question/answer pair. Theunderlying principle that makes this possible is the presence of afull-text unique key column for each table that is registered forfull-text searches. Thus when processing a full-text search, SQL SearchService 1010 returns to SQL server 1002 the key values of the rows thatmatch the database. In maintaining these full-text databases 1013 andfull text indexes 1014, the present invention has the uniquecharacteristic that the full-text indices 1014 are not updated instantlywhen the full-text registered columns are updated. This operation iseliminated, again, to reduce recognition latency, increase responsespeed, etc. Thus, as compared to other database architectures, thisupdating of the full-text index tables, which would otherwise take asignificant time, is instead done asynchronously at a more convenienttime.

Interface Between NLE 190 and DB Processor 188

The result set 1015 of candidate questions corresponding to the userquery utterance are presented to NLE 190 for further processing as shownin FIG. 4D to determine a “best” matching question/answer pair. AnNLE/DBProcessor interface module coordinates the handling of userqueries, analysis of noun-phrases (NPs) of retrieved questions sets fromthe SQL query based on the user query, comparing the retrieved questionNPs with the user query NP, etc. between NLE 190 and DB Processor 188.So, this part of the server side code contains functions, whichinterface processes resident in both NLE block 190 and DB Processorblock 188. The functions are illustrated in FIG. 4D; As seen here, coderoutine 880 implements functions to extract the Noun Phrase (NP) listfrom the user's question. This part of the code interacts with NLE 190and gets the list of Noun Phrases in a sentence articulated by the user.Similarly, Routine 813 retrieves an NP list from the list ofcorresponding candidate/paired questions 1015 and stores these questionsinto an (ranked by NP value) array. Thus, at this point, NP data hasbeen generated for the user query, as well as for the candidatequestions 1015. As an example of determining the noun phrases of asentence such as: “What issues have guided the President in consideringthe impact of foreign trade policy on American businesses?” NLE 190would return the following as noun phrases: President, issues, impact offoreign trade policy, American businesses, impact, impact of foreigntrade, foreign trade, foreign trade policy, trade, trade policy, policy,businesses. The methodology used by NLE 190 will thus be apparent tothose skilled in the art from this set of noun phrases and nounsub-phrases generated in response to the example query.

Next, a function identified as Get Best Answer ID 815 is implemented.This part of the code gets a best answer ID corresponding to the user'squery. To do this, routines 813A, 813B first find out the number of Nounphrases for each entry in the retrieved set 1015 that match with theNoun phrases in the user's query. Then routine 815 a selects a finalresult record from the candidate retrieved set 1015 that contains themaximum number of matching Noun phrases.

Conventionally, nouns are commonly thought of as “naming” words, andspecifically as the names of “people, places, or things”. Nouns such asJohn, London, and computer certainly fit this description, but the typesof words classified by the present invention as nouns is much broaderthan this. Nouns can also denote abstract and intangible concepts suchas birth, happiness, evolution, technology, management, imagination,revenge, politics, hope, cookery, sport, and literacy. Because of theenormous diversity of nouns compared to other parts of speech, theApplicant has found that it is much more relevant to consider the nounphrase as a key linguistic metric. So, the great variety of itemsclassified as nouns by the present invention helps to discriminate andidentify individual speech utterances much easier and faster than priortechniques disclosed in the art.

Following this same thought, the present invention also adopts andimplements another linguistic entity—the word phrase—to facilitatespeech query recognition. The basic structure of a word phrase—whetherit be a noun phrase, verb phrase, adjective phrase—is threeparts—[pre-Head string],[Head] and [post-Head string]. For example, inthe minimal noun phrase—“the children,” “children” is classified as theHead of the noun phrase. In summary, because of the diversity andfrequency of noun phrases, the choice of noun phrase as the metric bywhich stored answer is linguistically chosen, has a solid justificationin applying this technique to the English natural language as well asother natural languages. So, in sum, the total noun phrases in a speechutterance taken together operate extremely well as unique type of speechquery fingerprint.

The ID corresponding to the best answer corresponding to the selectedfinal result record question is then generated by routine 815 which thenreturns it to DB Process shown in FIG. 4C. As seen there, a Best AnswerID I is received by routine 716A, and used by a routine 716B to retrievean answer file path. Routine 716C then opens and reads the answer file,and communicates the substance of the same to routine 716D. The latterthen compresses the answer file data, and sends it over data link 160 toclient side system 150 for processing as noted earlier (i.e., to berendered into audible feedback, visual text/graphics, etc.). Again, inthe context of a learning/instructional application, the answer file mayconsist solely of a single text phrase, but in other applications thesubstance and format will be tailored to a specific question in anappropriate fashion. For instance, an “answer” may consist of a list ofmultiple entries corresponding to a list of responsive category items(i.e., a list of books to a particular author) etc. Other variationswill be apparent depending on the particular environment.

Natural Language Engine 190

Again referring to FIG. 4D, the general structure of NL engine 190 isdepicted. This engine implements the word analysis or morphologicalanalysis of words that make up the user's query, as well as phraseanalysis of phrases extracted from the query.

As illustrated in FIG. 9, the functions used in a morphological analysisinclude tokenizers 802A, stemmers 804A and morphological analyzers 806A.The functions that comprise the phrase analysis include tokenizers,taggers and groupers, and their relationship is shown in FIG. 8.

Tokenizer 802A is a software module that functions to break up text ofan input sentence 801A into a list of tokens 803A. In performing thisfunction, tokenizer 802A goes through input text 801A and treats it as aseries of tokens or useful meaningful units that are typically largerthan individual characters, but smaller than phrases and sentences.These tokens 803A can include words, separable parts of word andpunctuation. Each token 803A is given an offset and a length. The firstphase of tokenization is segmentation, which extracts the individualtokens from the input text and keeps track of the offset where eachtoken originated from in the input text. Next, categories are associatedwith each token, based on its shape. The process of tokenization iswell-known in the art, so it can be performed by any convenientapplication suitable for the present invention.

Following tokenization, a stemmer process 804A is executed, which caninclude two separate forms—inflectional and derivational, for analyzingthe tokens to determine their respective stems 805A. An inflectionalstemmer recognizes affixes and returns the word which is the stem. Aderivational stemmer on the other hand recognizes derivational affixesand returns the root word or words. While stemmer 804A associates aninput word with its stem, it does not have parts of speech information.Analyzer 806B takes a word independent of context, and returns a set ofpossible parts of speech 806A.

As illustrated in FIG. 8, phrase analysis 800 is the next step that isperformed after tokenization. A tokenizer 802 generates tokens frominput text 801. Tokens 803 are assigned to parts of a speech tag by atagger routine 804, and a grouper routine 806 recognizes groups of wordsas phrases of a certain syntactic type. These syntactic types includefor example the noun phrases mentioned earlier, but could include othertypes if desired such as verb phrases and adjective phrases.Specifically, tagger 804 is a parts-of-speech disambiguator, whichanalyzes words in context. It has a built-in morphological analyzer (notshown) that allows it to identify all possible parts of speech for eachtoken. The output of tagger 804 is a string with each token tagged witha parts-of-speech label 805. The final step in the linguistic process800 is the grouping of words to form phrases 807. This function isperformed by the grouper 806, and is very dependent, of course, on theperformance and output of tagger component 804.

Accordingly, at the end of linguistic processing 800, a list of nounphrases (NP) 807 is generated in accordance with the user's queryutterance. This set of NPs generated by NLE 190 helps significantly torefine the search for the best answer, so that a single-best answer canbe later provided for the user's question.

The particular components of NLE 190 are shown in FIG. 4D, and includeseveral components. Each of these components implement the severaldifferent functions required in NLE 190 as now explained.

Initialize Grouper Resources Object and the Library 900—this routineinitializes the structure variables required to create grouper resourceobject and library. Specifically, it initializes a particular naturallanguage used by NLE 190 to create a Noun Phrase, for example theEnglish natural language is initialized for a system that serves theEnglish language market. In turn, it also creates the objects (routines)required for Tokenizer, Tagger and Grouper (discussed above) withroutines 900A, 900B, 900C and 900D respectively, and initializes theseobjects with appropriate values. It also allocates memory to store allthe recognized Noun Phrases for the retrieved question pairs.

Tokenizing of the words from the given text (from the query or thepaired questions) is performed with routine 909B—here all the words aretokenized with the help of a local dictionary used by NLE 190 resources.The resultant tokenized words are passed to a Tagger routine 909C. Atroutine 909C, tagging of all the tokens is done and the output is passedto a Grouper routine 909D.

The Grouping of all tagged token to form NP list is implemented byroutine 909D so that the Grouper groups all the tagged token words andoutputs the Noun Phrases.

Un-initializing of the grouper resources object and freeing of theresources, is performed by routines 909EA, 909EB and 909EC. Theseinclude Token Resources, Tagger Resources and Grouper Resourcesrespectively. After initialization, the resources are freed. The memorythat was used to store all Noun Phrases are also de-allocated.

Additional Embodiments

In an e-commerce embodiment of the present invention as illustrated inFIG. 13, a web page 1300 contains typical visible links such as Books1310, Music 1320 so that on clicking the appropriate link the customeris taken to those pages. The web page may be implemented using HTML, aJava applet, or similar coding techniques which interact with the user'sbrowser. For example, if customer wants to buy an album C by ArtistAlbert, he traverses several web pages as follows: he first clicks onMusic (FIG. 13, 1360), which brings up page 1400 where he/she thenclicks on Records (FIG. 14, 1450). Alternatively, he/she could selectCDs 1460, Videos 1470, or other categories of books 1410, music 1420 orhelp 1430. As illustrated in FIG. 15, this brings up another web page1500 with links for Records 1550, with sub-categories—Artist 1560, Song1570, Title 1580, Genre 1590. The customer must then click on Artist1560 to select the artist of choice. This displays another web page 1600as illustrated in FIG. 16. On this page the various artists 1650 arelisted as illustrated—Albert 1650, Brooks 1660, Charlie 1670, Whyte 1690are listed under the category Artists 1650. The customer must now clickon Albert 1660 to view the albums available for Albert. When this isdone, another web page is displayed as shown in FIG. 17. Again this webpage 1700 displays a similar look and feel, but with the albumsavailable 1760, 1770, 1780 listed under the heading Titles 1750. Thecustomer can also read additional information 1790 for each album. Thisalbum information is similar to the liner notes of a shrink-wrappedalbum purchased at a retail store. One Album A is identified, thecustomer must click on the Album A 1760. This typically brings upanother text box with the information about its availability, price,shipping and handling charges etc.

When web page 1300 is provided with functionality of a NLQS of the typedescribed above, the web page interacts with the client side and serverside speech recognition modules described above. In this case, the userinitiates an inquiry by simply clicking on a button designated ContactMe for Help 1480 (this can be a link button on the screen, or a key onthe keyboard for example) and is then told by character 1440 about howto elicit the information required. If the user wants Album A by artistAlbert, the user could articulate “Is Album A by Brooks available?” inmuch the same way they would ask the question of a human clerk at abrick and mortar facility. Because of the rapid recognition performanceof the present invention, the user's query would be answered inreal-time by character 1440 speaking out the answer in the user's nativelanguage. If desired, a readable word balloon 1490 could also bedisplayed to see the character's answer and so that save/print optionscan also be implemented. Similar appropriate question/answer pairs foreach page of the website can be constructed in accordance with thepresent teachings, so that the customer is provided with an environmentthat emulates a normal conversational human-like question and answerdialog for all aspects of the web site. Character 1440 can be adjustedand tailored according to the particular commercial application, or bythe user's own preferences, etc. to have a particular voice style (man,woman, young, old, etc.) to enhance the customer's experience.

In a similar fashion, an articulated user query might be received aspart of a conventional search engine query, to locate information ofinterest on the INTERNET in a similar manner as done with conventionaltext queries. If a reasonably close question/answer pair is notavailable at the server side (for instance, if it does not reach acertain confidence level as an appropriate match to the user's question)the user could be presented with the option of increasing the scope sothat the query would then be presented simultaneously to one or moredifferent NLEs across a number of servers, to improve the likelihood offinding an appropriate matching question/answer pair. Furthermore, ifdesired, more than one “match” could be found, in the same fashion thatconventional search engines can return a number of potential “hits”corresponding to the user's query. For some such queries, of course, itis likely that real-time performance will not be possible (because ofthe disseminated and distributed processing) but the advantage presentedby extensive supplemental question/answer database systems may bedesirable for some users.

It is apparent as well that the NLQS of the present invention is verynatural and saves much time for the user and the e-commerce operator aswell. In an e-support embodiment, the customer can retrieve informationquickly and efficiently, and without need for a live customer agent. Forexample, at a consumer computer system vendor related support site, asimple diagnostic page might be presented for the user, along with avisible support character to assist him/her. The user could then selectitems from a “symptoms” page (i.e., a “monitor” problem, a “keyboard”problem, a “printer” problem, etc.) simply by articulating such symptomsin response to prompting from the support character. Thereafter, thesystem will direct the user on a real-time basis to more specificsub-menus, potential solutions, etc. for the particular recognizedcomplaint. The use of a programmable character thus allows the web siteto be scaled to accommodate a large number of hits or customers withoutany corresponding need to increase the number of human resources and itsattendant training issues.

As an additional embodiment, the searching for information on aparticular web site may be accelerated with the use of the NLQS of thepresent invention. Additionally, a significant benefit is that theinformation is provided in a user-friendly manner through the naturalinterface of speech. The majority of web sites presently employ lists offrequently asked questions which the user typically wades item by itemin order to obtain an answer to a question or issue. For example, asdisplayed in FIG. 13, the customer clicks on Help 1330 to initiate theinterface with a set of lists. Other options include computer relateditems at 1370 and frequently asked questions (FAQ) at 1380.

As illustrated in FIG. 18, a web site plan for typical web page isdisplayed. This illustrates the number of pages that have to betraversed in order to reach the list of Frequently-Asked Questions. Onceat this page, the user has to scroll and manually identify the questionthat matches his/her query. This process is typically a laborious taskand may or may not yield the information that answers the user's query.The present art for displaying this information is illustrated in FIG.18. This figure identifies how the information on a typical web site isorganized: the Help link (FIG. 13, 1330) typically shown on the homepage of the web page is illustrated shown on FIG. 18 as 1800. Againreferring to FIG. 18, each sub-category of information is listed on aseparate page. For example, 1810 lists sub-topics such as ‘First TimeVisitors’, ‘Search Tips’, ‘Ordering’, ‘Shipping’, ‘Your Account’ etc.Other pages deal with ‘Account information’ 1860, ‘Rates and Policies’1850 etc. Down another level, there are pages that deal exclusively witha sub-sub topics on a specific page such as ‘First Time Visitors’ 1960,‘Frequently Asked Questions’ 1950, ‘Safe Shopping Guarantee’ 1940, etc.So if a customer has a query that is best answered by going to theFrequently Asked Questions link, he or she has to traverse three levelsof busy and cluttered screen pages to get to the Frequently AskedQuestions page 1950. Typically, there are many lists of questions 1980that have to be manually scrolled through. While scrolling visually, thecustomer then has to visually and mentally match his or her questionwith each listed question. If a possible match is sighted, then thatquestion is clicked and the answer then appears in text form which thenis read.

In contrast, the process of obtaining an answer to a question using aweb page enabled with the present NLQS can be achieved much lesslaboriously and efficiently. The user would articulate the word “Help”(FIG. 13, 1330). This would immediately cause a character (FIG. 13,1340) to appear with the friendly response “May I be of assistance.Please state your question?”. Once the customer states the question, thecharacter would then perform an animation or reply “Thank you, I will beback with the answer soon”. After a short period time (preferably notexceeding 5-7 seconds) the character would then speak out the answer tothe user's question. As illustrated in FIG. 18 the answer would be theanswer 1990 returned to the user in the form of speech is the answerthat is paired with the question 1950. For example, the answer 1990: “Weaccept Visa, MasterCard and Discover credit cards”, would be theresponse to the query 2000 “What forms of payments do you accept?”

Another embodiment of the invention is illustrated in FIG. 12. This webpage illustrates a typical website that employs NLQS in a web-basedlearning environment. As illustrated in FIG. 12, the web page in browser1200, is divided into two or more frames. A character 1210 in thelikeness of an instructor is available on the screen and appears whenthe student initiates the query mode either by speaking the word “Help”into a microphone (FIG. 2B, 215) or by clicking on the link ‘Click toSpeak’ (FIG. 12, 1280). Character 1210 would then prompt the student toselect a course 1220 from the drop down list 1230. If the user selectsthe course ‘CPlusPlus’, the character would then confirm verbally thatthe course “CPlusPlus” was selected. The character would then direct thestudent to make the next selection from the drop-down list 1250 thatcontains the selections for the chapters 1240 from which questions areavailable. Again, after the student makes the selection, the character1210 confirms the selection by speaking. Next character 1210 prompts thestudent to select ‘Section’ 1260 of the chapter from which questions areavailable from the drop down list 1270. Again, after the student makesthe selection, character 1210 confirms the selection by articulating the‘Section’ 1260 chosen. As a prompt to the student, a list of possiblequestions appear in the list box 1291. In addition, tips 1290 for usingthe system are displayed. Once the selections are all made, the studentis prompted by the character to ask the question as follows: “Please askyour query now”. The student then speaks his query and after a shortperiod of time, the character responds with the answer preceded by thequestion as follows: “The answer to your question . . . is as follows: .. . ”. This procedure allows the student to quickly retrieve answers toquestions about any section of the course and replaces the tedium ofconsulting books, and references or indices. In short, it is can serve anumber of uses from being a virtual teacher answering questionson-the-fly or a flash card substitute.

From preliminary data available to the inventors, it is estimate thatthe system can easily accommodate 100-250 question/answer pairs whilestill achieving a real-time feel and appearance to the user (i.e., lessthan 10 seconds of latency, not counting transmission) using the abovedescribed structures and methods. It is expected, of course, that thesefigures will improve as additional processing speed becomes available,and routine optimizations are employed to the various components notedfor each particular environment.

Again, the above are merely illustrative of the many possibleapplications of the present invention, and it is expected that many moreweb-based enterprises, as well as other consumer applications (such asintelligent, interactive toys) can utilize the present teachings.Although the present invention has been described in terms of apreferred embodiment, it will be apparent to those skilled in the artthat many alterations and modifications may be made to such embodimentswithout departing from the teachings of the present invention. It willalso be apparent to those skilled in the art that many aspects of thepresent discussion have been simplified to give appropriate weight andfocus to the more germane aspects of the present invention. Themicrocode and software routines executed to effectuate the inventivemethods may be embodied in various forms, including in a permanentmagnetic media, a non-volatile ROM, a CD-ROM, or any other suitablemachine-readable format. Accordingly, it is intended that the all suchalterations and modifications be included within the scope and spirit ofthe invention as defined by the following claims.

1. A method of performing recognition of a speech utterance from a userwith a distributed client-server system comprising the steps of: (a)receiving user speech data from a client device in streaming packetsthrough a network interface of a network server system, said speech dataresulting from a first set of speech recognition operations beingperformed on the speech utterance by a client device; (b) recognizingthe speech utterance as well as a natural language used in said speechutterance using processing routines executing at said network serversystem which implement a second set of speech recognition operations;(c) providing a response to the user in a same natural language asrecognized in step (b).
 2. The method of claim 1, wherein said responseis an audible response from an electronic agent generated by a text tospeech engine.
 3. The method of claim 1, wherein an interactiveelectronic agent provides said response, which interactive electronicagent exhibits characteristics that are adjusted by said processingroutines executing at said network server system based on a type ofapplication interacting with the user.
 4. The method of claim 1, whereinan interactive electronic agent provides said response, whichinteractive electronic agent exhibits characteristics that are adjustedby said processing routines executing at the network server system basedon an identity of the user.
 5. The method of claim 1, wherein aninteractive electronic agent presented within a browser or graphicalinterface of the client system provides said response, which interactiveelectronic agent responds to user queries presented in speech form andassists the user to navigate and select items from an Internet web page.6. The method of claim 2, wherein said interactive electronic agentfurther provides one or more specific suggested queries to the user. 7.The method of claim 2, wherein said interactive electronic agent furtherprovides audible confirmation of selections made by the user.
 8. Themethod of claim 1, wherein a plurality of Hidden Markov Models are usedto recognize said speech utterance and said natural language of saidspeech utterance such that the system supports multiple naturallanguages.
 9. The method of claim 1, further including a step:performing a natural language processing operation on said speechutterance to determine a meaning of a query presented by the user. 10.The method of claim 9, further including a step: forming a databasequery based on identifying said query presented by the user to retrievea predetermined answer for said query.
 11. The method of claim 1,wherein said processing routines at the network server system use one ormore speech recognition models that are trained and optimized based onspeech characteristics of a group of persons residing in geographicalregions served by the distributed client—server system.
 12. The methodof claim 1, wherein said client device is a portable Internet basedappliance.
 13. The method of claim 1, further including a step:adjusting said second set of speech recognition operations based on anevaluation of resources available at the network server system and/orthe client device.
 14. The method of claim 13, further including a step:adjusting said first set of speech recognition operations based on anevaluation of resources available at the client device.
 15. The methodof claim 1, wherein said processing routines include software programsexecutable on said network server system.
 16. A method of performingrecognition of a speech utterance from a user with a distributedclient-server system comprising the steps of: (a) receiving user speechdata from a client device through a network interface of a networkserver system, said speech data constituting partially recognized speechderived by a client device from a speech utterance; (b) completingrecognition of the speech utterance and identifying a language thereinusing software routines executing at said network server system and aplurality of speech models associated with a plurality of languages; (c)processing the speech utterance with one or more natural languageoperations to identify a meaning of the speech utterance; (d)identifying a query presented by the user based on said meaning of thespeech utterance; (e) providing a response to the query in a samelanguage as recognized in step (b).
 17. The method of claim 16, furtherincluding a step: calibrating noise data present at the client device.18. The method of claim 16, wherein said response is an audible responsefrom an electronic agent generated by a text to speech engine.
 19. Themethod of claim 18, further including a step: conducting an interactivedialog with the user using said electronic agent in response to furtherspeech utterances.
 20. A system for recognizing a speech utterance froma user comprising: (a) a first processing routine adapted to receiveuser speech data from a client device in streaming packets through anetwork interface of a network server system, said speech data resultingfrom a first set of speech recognition operations being performed on thespeech utterance by the client device; (b) a second processing routineadapted to recognize the speech utterance as well as a natural languageused in said speech utterance by executing a second set of speechrecognition operations; (c) a third processing routine for providing aresponse to the user in a same natural language as recognized by saidsecond software routine.
 21. The system of claim 20, wherein a number ofspeech recognition operations performed on the network server system canbe adjusted.
 22. The system of claim 20, wherein said response is anaudible response from an electronic agent generated by a text to speechengine.
 23. The system of claim 20 wherein an interactive electronicagent provides said response, which electronic agent exhibitscharacteristics that are adjusted by said third processing routine basedon a type of application interacting with the user.
 24. The system ofclaim 20, wherein an interactive electronic agent presented within abrowser of the client system provides said response, which interactiveelectronic agent responds to user queries presented in speech form andassists the user to navigate and select items from an Internet web page.25. The system of claim 22, wherein said electronic agent furtherprovides one or more specific suggested queries to the user.
 26. Thesystem of claim 22, wherein said interactive electronic agent furtherprovides audible confirmation of selections made by the user.
 27. Thesystem of claim 20 wherein a plurality of Hidden Markov Models are usedto recognize said speech utterance and said language of said speechutterance such that multiple natural languages are supported.
 28. Thesystem of claim 20, further including a natural language processingroutine which operates on said speech utterance to determine a meaningof a query presented by the user.
 29. The system of claim 28, furtherincluding a database interface which generates a database query based onidentifying said query presented by the user to retrieve a predeterminedanswer for said query.
 30. The system of claim 20, wherein said secondprocessing routine uses one or more speech recognition models that aretrained and optimized based on speech characteristics of a group ofpersons residing in geographical regions served by the distributedclient-server system.
 31. The system of claim 20, wherein said clientdevice is a portable Internet based appliance.
 32. The system of claim19, further including a routine which adjusts said second set of speechrecognition operations based on an evaluation of resources available atthe network server system.
 33. The system of claim 32, further includinga routine which adjusts said first set of speech recognition operationsbased on an evaluation of resources available at the client device. 34.The system of claim 20, wherein speech recognition operations can beallocated between the client device and the network server system on aquery-by-query basis.
 35. The system of claim 20, wherein the networkserver system is a group of interlinked computing servers.
 36. Thesystem of claim 20, wherein said processing routines include softwareprograms executable on said network server system.
 37. A system forrecognizing a speech utterance from a user comprising: (a) a firstroutine adapted to receive user speech data from a client device througha network interface of a network server system, said speech dataconstituting partially recognized speech derived by a client device froma speech utterance; (b) a second routine executing at said networkserver system which is adapted to complete recognition of the speechutterance and to identify a natural language therein using a pluralityof speech models associated with a plurality of natural languages; (c) athird routine adapted to process the speech utterance with one or morenatural language operations to identify a meaning of the speechutterance; (d) a fourth routine adapted to identify a query presented bythe user based on said meaning of the speech utterance; (e) a fifthroutine adapted to provide a response to the query in a same naturallanguage as recognized in step (b).
 38. The system of claim 37, whereinall of said routines are implemented as executable software programs onsaid network server system.
 39. The system of claim 37, wherein saidfifth routine interacts with a second user using a second naturallanguage identified by said second routine.
 40. The system of claim 37wherein said speech data is received continuously during a speechutterance.
 41. The system of claim 40 wherein said speech data isreceived continuously until silence is detected.